2012-04-19 14:05:58

by David Howells

[permalink] [raw]
Subject: [PATCH 0/6] Extended file stat system call


Implement a pair of new system calls to provide extended and further extensible
stat functions.

The second of the associated patches is the main patch that provides these new
system calls:

ssize_t ret = xstat(int dfd,
const char *filename,
unsigned atflag,
unsigned mask,
struct xstat *buffer);

ssize_t ret = fxstat(int fd,
unsigned atflag,
unsigned mask,
struct xstat *buffer);

which are more fully documented in the first patch's description.

These new stat functions provide a number of useful features, in summary:

(1) More information: creation time, inode generation number, data version
number, flags/attributes. A subset of these is available through a number
of filesystems (such as CIFS, NFS, AFS, Ext4 and BTRFS).

(2) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server.

(3) Heavyweight stat: Force a netfs to go to the server, even if it thinks its
cached attributes are up to date.

(4) Allow the filesystem to indicate what it can/cannot provide: A filesystem
can now say it doesn't support a standard stat feature if that isn't
available.

(5) Make the fields a consistent size on all arches, and make them large.

(6) Can be extended by using more request flags and appending further data
after the end of the standard return data.

Note that no lstat() equivalent is required as that can be implemented through
xstat() with atflag == 0.


=======
PATCHES
=======

Patch 1 defines the xstat() and fxstat() system calls.

Patches 2-6 implement extended stat facilities for Ext4, AFS, NFS and CIFS, and
make eCryptFS go to the lower filesystem for such details.


==============
CONSIDERATIONS
==============

Should fxstat() be implemented as xstat() with a NULL filename, using dfd as
fd?

Should the default for a network fs be to do an unconditional (heavyweight)
stat with a flag to suppress going to the server to update the locally held
attributes and flushing pending writebacks?

Should things like the Windows Archive, Hidden and System bits be handled
through IOC flags, perhaps expanded to 64-bits?

Are these things useful to userspace other than Samba and userspace NFS
servers?

Is it useful to pass the volume ID out? Or is statfs() sufficient for this?

Should I add a sixth argument to xstat(), mark it reserved and require that
must be supplied as 0 to hedge against future use?

Is there anything else I can usefully add at the moment?


==========
TO BE DONE
==========

Autofs, ntfs, btrfs, ...

I should perhaps use u8/u32/u64 rather than uint8/32/64_t.

Handle remote filesystems being offline and indicate this with
XSTAT_INFO_OFFLINE.


=======
TESTING
=======

There's a test program attached to the description for the main patch. It can
be run as follows:
[root@andromeda tmp]# ./xstat -R /mnt/foo

xstat(/mnt/foo) = 0
0000: 000081a40000ffef 0000000000000001 0000020000000000 0000100000080000
0020: 0000000000000000 0000000600000008 000000004f88499a 0000000136fd9208
0040: 000000004f88499a 0000000136fd9208 000000004f8849b9 0000000106daf187
0060: 000000004f8849b9 0000000106daf187 000000000000000c 000000000000000f
0080: 0000000000000008 00000000484ebbef 0000000000000025 5949ebd4711efd82
00a0: d3250b5c15d5e380 0000000000000000 0000000000000000 0000000000000000
00c0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
00e0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
results=ffef
Size: 15 Blocks: 8 IO Block: 4096 regular file
Device: 08:06 Inode: 12 Links: 1
Access: (0644/-rw-r--r--) Uid: 0
Gid: 0
Access: 2012-04-13 16:43:22.922587656+0100
Modify: 2012-04-13 16:43:53.115011975+0100
Change: 2012-04-13 16:43:53.115011975+0100
Create: 2012-04-13 16:43:22.922587656+0100
Inode version: 484ebbefh
Data version: 25h
Inode flags: 00080000 (-------- ----e--- -------- --------)
Information: 00000200 (-------- -------- ------a- --------)
Volume ID: 82fd1e71d4eb4959-80e3d5155c0b25d3

David
---
David Howells (6):
xstat: eCryptFS: Return extended attributes
xstat: CIFS: Return extended attributes
xstat: NFS: Return extended attributes
xstat: AFS: Return extended attributes
xstat: Ext4: Return extended attributes
xstat: Add a pair of system calls to make extended file stats available


arch/x86/syscalls/syscall_32.tbl | 2
arch/x86/syscalls/syscall_64.tbl | 2
fs/afs/inode.c | 29 ++-
fs/afs/super.c | 7 +
fs/cifs/cifsfs.h | 4
fs/cifs/cifsglob.h | 16 +-
fs/cifs/dir.c | 2
fs/cifs/inode.c | 120 +++++++++++--
fs/ecryptfs/inode.c | 14 +-
fs/ext4/ext4.h | 2
fs/ext4/file.c | 2
fs/ext4/inode.c | 32 +++
fs/ext4/namei.c | 2
fs/ext4/super.c | 1
fs/ext4/symlink.c | 2
fs/nfs/inode.c | 49 ++++-
fs/nfs/super.c | 1
fs/stat.c | 350 +++++++++++++++++++++++++++++++++++---
include/linux/fcntl.h | 1
include/linux/fs.h | 4
include/linux/stat.h | 126 +++++++++++++-
include/linux/syscalls.h | 7 +
22 files changed, 694 insertions(+), 81 deletions(-)


2012-04-19 14:07:26

by David Howells

[permalink] [raw]
Subject: [PATCH 3/6] xstat: AFS: Return extended attributes

Return extended attributes from the AFS filesystem. This includes the
following:

(1) The vnode uniquifier as st_gen.

(2) The data version number as st_data_version.

(3) XSTAT_INFO_AUTOMOUNT will be set on automount directories by virtue of
S_AUTOMOUNT being set on the inode. These are referrals to other volumes
or other cells.

(4) XSTAT_INFO_AUTODIR on a directory that does cell lookup for non-existent
names and mounts them (typically mounted on /afs with -o autocell). The
resulting directories are marked XSTAT_INFO_FABRICATED as they do not
actually exist in the mounted AFS directory.

(6) Files, directories and symlinks accessed over AFS are marked
XSTAT_INFO_REMOTE.

(7) XSTAT_INFO_NONSYSTEM_OWNERSHIP is set as the UID and GID retrieved from an
AFS share may not be applicable on the system.

(8) XSTAT_INFO_HAS_ACL is set as AFS directories have ACLs (the UID and GID
are only used through the ACLs) and these ACLs apply to the contents of
the directories.

Signed-off-by: David Howells <[email protected]>
---

fs/afs/inode.c | 29 +++++++++++++++++++++--------
fs/afs/super.c | 7 +++++++
2 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index d890ae3..062def2 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -71,9 +71,9 @@ static int afs_inode_map_status(struct afs_vnode *vnode, struct key *key)
inode->i_uid = vnode->status.owner;
inode->i_gid = 0;
inode->i_size = vnode->status.size;
- inode->i_ctime.tv_sec = vnode->status.mtime_server;
- inode->i_ctime.tv_nsec = 0;
- inode->i_atime = inode->i_mtime = inode->i_ctime;
+ inode->i_mtime.tv_sec = vnode->status.mtime_server;
+ inode->i_mtime.tv_nsec = 0;
+ inode->i_atime = inode->i_ctime = inode->i_mtime;
inode->i_blocks = 0;
inode->i_generation = vnode->fid.unique;
inode->i_version = vnode->status.data_version;
@@ -374,16 +374,29 @@ error_unlock:
/*
* read the attributes of an inode
*/
-int afs_getattr(struct vfsmount *mnt, struct dentry *dentry,
- struct kstat *stat)
+int afs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
{
- struct inode *inode;
-
- inode = dentry->d_inode;
+ struct inode *inode = dentry->d_inode;

_enter("{ ino=%lu v=%u }", inode->i_ino, inode->i_generation);

generic_fillattr(inode, stat);
+
+ stat->result_mask &= ~(XSTAT_ATIME | XSTAT_CTIME | XSTAT_BLOCKS);
+ stat->result_mask |= XSTAT_GEN | XSTAT_VERSION;
+ stat->gen = inode->i_generation;
+ stat->version = inode->i_version;
+
+ if (test_bit(AFS_VNODE_AUTOCELL, &AFS_FS_I(inode)->flags))
+ stat->information |= XSTAT_INFO_AUTODIR;
+
+ if (test_bit(AFS_VNODE_PSEUDODIR, &AFS_FS_I(inode)->flags))
+ stat->information |= XSTAT_INFO_FABRICATED;
+ else
+ stat->information |= XSTAT_INFO_REMOTE;
+
+ stat->information |=
+ XSTAT_INFO_NONSYSTEM_OWNERSHIP | XSTAT_INFO_HAS_ACL;
return 0;
}

diff --git a/fs/afs/super.c b/fs/afs/super.c
index f02b31e..1f13b48 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -314,6 +314,13 @@ static int afs_fill_super(struct super_block *sb,
sb->s_bdi = &as->volume->bdi;
strlcpy(sb->s_id, as->volume->vlocation->vldb.name, sizeof(sb->s_id));

+ /* construct a volume ID from the AFS volume ID and type */
+ sb->s_volume_id[4] = as->volume->type;
+ sb->s_volume_id[3] = as->volume->vid >> 0;
+ sb->s_volume_id[2] = as->volume->vid >> 8;
+ sb->s_volume_id[1] = as->volume->vid >> 16;
+ sb->s_volume_id[0] = as->volume->vid >> 24;
+
/* allocate the root inode and dentry */
fid.vid = as->volume->vid;
fid.vnode = 1;


2012-04-19 14:06:53

by David Howells

[permalink] [raw]
Subject: [PATCH 4/6] xstat: NFS: Return extended attributes

Return extended attributes from the NFS filesystem. This includes the
following:

(1) The change attribute as st_data_version if NFSv4.

(2) XSTAT_INFO_AUTOMOUNT and XSTAT_INFO_FABRICATED are set on referral or
submount directories that are automounted upon. NFS shows one directory
with a different FSID, but the local filesystem has two: the mountpoint
directory and the root of the filesystem mounted upon it.

(3) XSTAT_INFO_REMOTE is set on files acquired over NFS.

Furthermore, what nfs_getattr() does can be controlled as follows:

(1) If AT_FORCE_ATTR_SYNC is indicated, or mtime, ctime or data_version (NFSv4
only) are requested then the outstanding writes will be written to the
server first.

(2) The inode's attributes may be synchronised with the server:

(a) If AT_FORCE_ATTR_SYNC is indicated or if atime is requested (and atime
updating is not suppressed by a mount flag) then the attributes will
be reread unconditionally.

(b) If the data version or any of basic stats are requested then the
attributes will be reread if the cached attributes have expired.

(c) Otherwise the cached attributes will be used - even if expired -
without reference to the server.

Signed-off-by: David Howells <[email protected]>
---

fs/nfs/inode.c | 49 +++++++++++++++++++++++++++++++++++++------------
fs/nfs/super.c | 1 +
2 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index e8bbfa5..460fcf3 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -509,11 +509,18 @@ void nfs_setattr_update_inode(struct inode *inode, struct iattr *attr)
int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
{
struct inode *inode = dentry->d_inode;
+ unsigned force = stat->query_flags & AT_FORCE_ATTR_SYNC;
int need_atime = NFS_I(inode)->cache_validity & NFS_INO_INVALID_ATIME;
int err;

- /* Flush out writes to the server in order to update c/mtime. */
- if (S_ISREG(inode->i_mode)) {
+ if (NFS_SERVER(inode)->nfs_client->rpc_ops->version < 4)
+ stat->request_mask &= ~XSTAT_VERSION;
+
+ /* Flush out writes to the server in order to update c/mtime
+ * or data version if the user wants them */
+ if ((force || (stat->request_mask &
+ (XSTAT_MTIME | XSTAT_CTIME | XSTAT_VERSION))) &&
+ S_ISREG(inode->i_mode)) {
err = filemap_write_and_wait(inode->i_mapping);
if (err)
goto out;
@@ -528,18 +535,36 @@ int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
* - NFS never sets MS_NOATIME or MS_NODIRATIME so there is
* no point in checking those.
*/
- if ((mnt->mnt_flags & MNT_NOATIME) ||
- ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode)))
+ if (mnt->mnt_flags & MNT_NOATIME ||
+ (mnt->mnt_flags & MNT_NODIRATIME && S_ISDIR(inode->i_mode))) {
+ stat->ioc_flags |= FS_NOATIME_FL;
+ need_atime = 0;
+ } else if (!(stat->request_mask & XSTAT_ATIME)) {
need_atime = 0;
+ }

- if (need_atime)
- err = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
- else
- err = nfs_revalidate_inode(NFS_SERVER(inode), inode);
- if (!err) {
- generic_fillattr(inode, stat);
- stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
+ if (force || stat->request_mask & (XSTAT_BASIC_STATS | XSTAT_VERSION)) {
+ if (force || need_atime)
+ err = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
+ else
+ err = nfs_revalidate_inode(NFS_SERVER(inode), inode);
+ if (err)
+ goto out;
}
+
+ generic_fillattr(inode, stat);
+ stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
+
+ if (stat->request_mask & XSTAT_VERSION) {
+ stat->version = inode->i_version;
+ stat->result_mask |= XSTAT_VERSION;
+ }
+
+ if (IS_AUTOMOUNT(inode))
+ stat->information |= XSTAT_INFO_FABRICATED;
+
+ stat->information |= XSTAT_INFO_REMOTE;
+
out:
return err;
}
@@ -852,7 +877,7 @@ int nfs_revalidate_inode(struct nfs_server *server, struct inode *inode)
static int nfs_invalidate_mapping(struct inode *inode, struct address_space *mapping)
{
struct nfs_inode *nfsi = NFS_I(inode);
-
+
if (mapping->nrpages != 0) {
int ret = invalidate_inode_pages2(mapping);
if (ret < 0)
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 37412f7..faa652c 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -2222,6 +2222,7 @@ static int nfs_set_super(struct super_block *s, void *data)
ret = set_anon_super(s, server);
if (ret == 0)
server->s_dev = s->s_dev;
+ memcpy(&s->s_volume_id, &server->fsid, sizeof(s->s_volume_id));
return ret;
}



2012-04-19 14:23:34

by David Howells

[permalink] [raw]
Subject: [PATCH 5/6] xstat: CIFS: Return extended attributes

Return extended attributes from the CIFS filesystem. This includes the
following:

(1) Return the file creation time as btime. We assume that the creation time
won't change over the life of the inode.

(2) Set XSTAT_INFO_AUTOMOUNT on referral/submount directories.

(3) Unset XSTAT_INO if we made up the inode number and didn't get it from the
server.

(4) Unset XSTAT_[UG]ID if we are either returning values passed to mount
and/or the server doesn't return them.

(5) Map various Windows file attributes to FS_xxx_FL flags in st_ioc_flags
and XSTAT_INFO_xxx flags in st_information, fetching them from the server
if we don't have them yet or don't have a current copy.

Possibly things like Hidden, System and Archive should be FS_xxx_FL flags
rather than XSTAT_INFO_xxx flags and st_ioc_flags should be expanded to
64 bits.

(6) Set XSTAT_INFO_REMOTE on all files fetched by CIFS.

(7) Set XSTAT_INFO_NONSYSTEM_OWNERSHIP on all files as they all have Windows
ownership details too.

(8) Set XSTAT_INFO_HAS_ACL if CONFIG_CIFS_ACL=y as Windows ACLs are available
on the object.

Furthermore, what cifs_getattr() does can be controlled as follows:

(1) If AT_FORCE_ATTR_SYNC is indicated, or if the inode flags or creation time
are requested but not yet collected, then the attributes will be reread
unconditionally.

(2) If the basic stats are requested or if the inode flags are requested and
have been collected previously, then the attributes will be reread if out
of date.

(3) Otherwise the cached attributes will be used - even if expired - without
reference to the server.

Note that cifs_revalidate_dentry() will issue an extra operation to get the
FILE_ALL_INFO in addition to the FILE_UNIX_BASIC_INFO if it needs to collect
creation time and attributes on behalf of cifs_getattr().

[NOTE: THIS PATCH IS UNTESTED!]

Signed-off-by: David Howells <[email protected]>
---

fs/cifs/cifsfs.h | 4 +-
fs/cifs/cifsglob.h | 16 +++++--
fs/cifs/dir.c | 2 -
fs/cifs/inode.c | 120 +++++++++++++++++++++++++++++++++++++++++++++-------
4 files changed, 118 insertions(+), 24 deletions(-)

diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index d1389bb..021e327 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -56,9 +56,9 @@ extern int cifs_rmdir(struct inode *, struct dentry *);
extern int cifs_rename(struct inode *, struct dentry *, struct inode *,
struct dentry *);
extern int cifs_revalidate_file_attr(struct file *filp);
-extern int cifs_revalidate_dentry_attr(struct dentry *);
+extern int cifs_revalidate_dentry_attr(struct dentry *, bool, bool);
extern int cifs_revalidate_file(struct file *filp);
-extern int cifs_revalidate_dentry(struct dentry *);
+extern int cifs_revalidate_dentry(struct dentry *, bool, bool);
extern int cifs_invalidate_mapping(struct inode *inode);
extern int cifs_getattr(struct vfsmount *, struct dentry *, struct kstat *);
extern int cifs_setattr(struct dentry *, struct iattr *);
diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 4ff6313..d3567da 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -621,11 +621,15 @@ struct cifsInodeInfo {
/* BB add in lists for dirty pages i.e. write caching info for oplock */
struct list_head openFileList;
__u32 cifsAttrs; /* e.g. DOS archive bit, sparse, compressed, system */
- bool clientCanCacheRead; /* read oplock */
- bool clientCanCacheAll; /* read and writebehind oplock */
- bool delete_pending; /* DELETE_ON_CLOSE is set */
- bool invalid_mapping; /* pagecache is invalid */
+ bool clientCanCacheRead:1; /* read oplock */
+ bool clientCanCacheAll:1; /* read and writebehind oplock */
+ bool delete_pending:1; /* DELETE_ON_CLOSE is set */
+ bool invalid_mapping:1; /* pagecache is invalid */
+ bool btime_valid:1; /* stored creation time is valid */
+ bool uid_faked:1; /* true if i_uid is faked */
+ bool gid_faked:1; /* true if i_gid is faked */
unsigned long time; /* jiffies of last update of inode */
+ struct timespec btime; /* creation time */
u64 server_eof; /* current file size on server -- protected by i_lock */
u64 uniqueid; /* server inode number */
u64 createtime; /* creation time on server */
@@ -833,6 +837,9 @@ struct dfs_info3_param {
#define CIFS_FATTR_DELETE_PENDING 0x2
#define CIFS_FATTR_NEED_REVAL 0x4
#define CIFS_FATTR_INO_COLLISION 0x8
+#define CIFS_FATTR_WINATTRS_VALID 0x10 /* T if cf_btime and cf_cifsattrs valid */
+#define CIFS_FATTR_UID_FAKED 0x20 /* T if cf_uid is faked */
+#define CIFS_FATTR_GID_FAKED 0x40 /* T if cf_gid is faked */

struct cifs_fattr {
u32 cf_flags;
@@ -850,6 +857,7 @@ struct cifs_fattr {
struct timespec cf_atime;
struct timespec cf_mtime;
struct timespec cf_ctime;
+ struct timespec cf_btime;
};

static inline void free_dfs_info_param(struct dfs_info3_param *param)
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index d172c8e..d9e03ae 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -664,7 +664,7 @@ cifs_d_revalidate(struct dentry *direntry, struct nameidata *nd)
return -ECHILD;

if (direntry->d_inode) {
- if (cifs_revalidate_dentry(direntry))
+ if (cifs_revalidate_dentry(direntry, false, false))
return 0;
else {
/*
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index 745da3d..662d5ce 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -135,13 +135,21 @@ cifs_fattr_to_inode(struct inode *inode, struct cifs_fattr *fattr)
set_nlink(inode, fattr->cf_nlink);
inode->i_uid = fattr->cf_uid;
inode->i_gid = fattr->cf_gid;
+ if (fattr->cf_flags & CIFS_FATTR_UID_FAKED)
+ cifs_i->uid_faked = true;
+ if (fattr->cf_flags & CIFS_FATTR_GID_FAKED)
+ cifs_i->gid_faked = true;

/* if dynperm is set, don't clobber existing mode */
if (inode->i_state & I_NEW ||
!(cifs_sb->mnt_cifs_flags & CIFS_MOUNT_DYNPERM))
inode->i_mode = fattr->cf_mode;

- cifs_i->cifsAttrs = fattr->cf_cifsattrs;
+ if (fattr->cf_flags & CIFS_FATTR_WINATTRS_VALID) {
+ cifs_i->cifsAttrs = fattr->cf_cifsattrs;
+ cifs_i->btime = fattr->cf_btime;
+ cifs_i->btime_valid = true;
+ }

if (fattr->cf_flags & CIFS_FATTR_NEED_REVAL)
cifs_i->time = 0;
@@ -248,15 +256,19 @@ cifs_unix_basic_to_fattr(struct cifs_fattr *fattr, FILE_UNIX_BASIC_INFO *info,
break;
}

- if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_OVERR_UID)
+ if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_OVERR_UID) {
fattr->cf_uid = cifs_sb->mnt_uid;
- else
+ fattr->cf_flags |= CIFS_FATTR_UID_FAKED;
+ } else {
fattr->cf_uid = le64_to_cpu(info->Uid);
+ }

- if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_OVERR_GID)
+ if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_OVERR_GID) {
fattr->cf_gid = cifs_sb->mnt_gid;
- else
+ fattr->cf_flags |= CIFS_FATTR_GID_FAKED;
+ } else {
fattr->cf_gid = le64_to_cpu(info->Gid);
+ }

fattr->cf_nlink = le64_to_cpu(info->Nlinks);
}
@@ -283,7 +295,8 @@ cifs_create_dfs_fattr(struct cifs_fattr *fattr, struct super_block *sb)
fattr->cf_ctime = CURRENT_TIME;
fattr->cf_mtime = CURRENT_TIME;
fattr->cf_nlink = 2;
- fattr->cf_flags |= CIFS_FATTR_DFS_REFERRAL;
+ fattr->cf_flags |= CIFS_FATTR_DFS_REFERRAL |
+ CIFS_FATTR_UID_FAKED | CIFS_FATTR_GID_FAKED;
}

int cifs_get_file_info_unix(struct file *filp)
@@ -510,6 +523,7 @@ cifs_all_info_to_fattr(struct cifs_fattr *fattr, FILE_ALL_INFO *info,
struct cifs_tcon *tcon = cifs_sb_master_tcon(cifs_sb);

memset(fattr, 0, sizeof(*fattr));
+ fattr->cf_flags = CIFS_FATTR_WINATTRS_VALID;
fattr->cf_cifsattrs = le32_to_cpu(info->Attributes);
if (info->DeletePending)
fattr->cf_flags |= CIFS_FATTR_DELETE_PENDING;
@@ -521,6 +535,7 @@ cifs_all_info_to_fattr(struct cifs_fattr *fattr, FILE_ALL_INFO *info,

fattr->cf_ctime = cifs_NTtimeToUnix(info->ChangeTime);
fattr->cf_mtime = cifs_NTtimeToUnix(info->LastWriteTime);
+ fattr->cf_btime = cifs_NTtimeToUnix(info->CreationTime);

if (adjust_tz) {
fattr->cf_ctime.tv_sec += tcon->ses->server->timeAdj;
@@ -1724,7 +1739,8 @@ int cifs_revalidate_file_attr(struct file *filp)
return rc;
}

-int cifs_revalidate_dentry_attr(struct dentry *dentry)
+int cifs_revalidate_dentry_attr(struct dentry *dentry,
+ bool want_extra_bits, bool force)
{
int xid;
int rc = 0;
@@ -1735,7 +1751,7 @@ int cifs_revalidate_dentry_attr(struct dentry *dentry)
if (inode == NULL)
return -ENOENT;

- if (!cifs_inode_needs_reval(inode))
+ if (!force && !cifs_inode_needs_reval(inode))
return rc;

xid = GetXid();
@@ -1752,9 +1768,12 @@ int cifs_revalidate_dentry_attr(struct dentry *dentry)
"%ld jiffies %ld", full_path, inode, inode->i_count.counter,
dentry, dentry->d_time, jiffies);

- if (cifs_sb_master_tcon(CIFS_SB(sb))->unix_ext)
+ if (cifs_sb_master_tcon(CIFS_SB(sb))->unix_ext) {
rc = cifs_get_inode_info_unix(&inode, full_path, sb, xid);
- else
+ if (rc != 0)
+ goto out;
+ }
+ if (!cifs_sb_master_tcon(CIFS_SB(sb))->unix_ext || want_extra_bits)
rc = cifs_get_inode_info(&inode, full_path, NULL, sb,
xid, NULL);

@@ -1779,12 +1798,13 @@ int cifs_revalidate_file(struct file *filp)
}

/* revalidate a dentry's inode attributes */
-int cifs_revalidate_dentry(struct dentry *dentry)
+int cifs_revalidate_dentry(struct dentry *dentry,
+ bool want_extra_bits, bool force)
{
int rc;
struct inode *inode = dentry->d_inode;

- rc = cifs_revalidate_dentry_attr(dentry);
+ rc = cifs_revalidate_dentry_attr(dentry, want_extra_bits, force);
if (rc)
return rc;

@@ -1796,11 +1816,30 @@ int cifs_revalidate_dentry(struct dentry *dentry)
int cifs_getattr(struct vfsmount *mnt, struct dentry *dentry,
struct kstat *stat)
{
+ struct cifsInodeInfo *cifs_i = CIFS_I(dentry->d_inode);
struct cifs_sb_info *cifs_sb = CIFS_SB(dentry->d_sb);
struct cifs_tcon *tcon = cifs_sb_master_tcon(cifs_sb);
struct inode *inode = dentry->d_inode;
+ unsigned force = stat->query_flags & AT_FORCE_ATTR_SYNC;
+ bool want_extra_bits = false;
+ u32 info, ioc = 0;
+ u32 attrs;
int rc;

+ if (cifs_i->uid_faked)
+ stat->request_mask &= ~XSTAT_UID;
+ if (cifs_i->gid_faked)
+ stat->request_mask &= ~XSTAT_GID;
+
+ if (stat->request_mask & XSTAT_BTIME && !cifs_i->btime_valid) {
+ want_extra_bits = true;
+ force = true;
+ }
+ if (stat->request_mask & XSTAT_IOC_FLAGS) {
+ want_extra_bits = true;
+ force = true;
+ }
+
/*
* We need to be sure that all dirty pages are written and the server
* has actual ctime, mtime and file length.
@@ -1814,13 +1853,14 @@ int cifs_getattr(struct vfsmount *mnt, struct dentry *dentry,
}
}

- rc = cifs_revalidate_dentry_attr(dentry);
- if (rc)
- return rc;
+ if (force || stat->request_mask & XSTAT_BASIC_STATS) {
+ rc = cifs_revalidate_dentry(dentry, want_extra_bits, force);
+ if (rc)
+ return rc;
+ }

generic_fillattr(inode, stat);
stat->blksize = CIFS_MAX_MSGSIZE;
- stat->ino = CIFS_I(inode)->uniqueid;

/*
* If on a multiuser mount without unix extensions, and the admin hasn't
@@ -1834,7 +1874,53 @@ int cifs_getattr(struct vfsmount *mnt, struct dentry *dentry,
if (!(cifs_sb->mnt_cifs_flags & CIFS_MOUNT_OVERR_GID))
stat->gid = current_fsgid();
}
- return rc;
+
+ info = XSTAT_INFO_REMOTE | XSTAT_INFO_NONSYSTEM_OWNERSHIP;
+#ifdef CONFIG_CIFS_ACL
+ info |= XSTAT_INFO_HAS_ACL;
+#endif
+
+ if (cifs_i->btime_valid) {
+ stat->btime = cifs_i->btime;
+ stat->result_mask |= XSTAT_BTIME;
+ }
+
+ /* We don't promise an inode number if we made one up */
+ stat->ino = cifs_i->uniqueid;
+ if (!(cifs_sb->mnt_cifs_flags & CIFS_MOUNT_SERVER_INUM))
+ stat->result_mask &= ~XSTAT_INO;
+
+ /*
+ * If on a multiuser mount without unix extensions, and the admin
+ * hasn't overridden them, set the ownership to the fsuid/fsgid of the
+ * current process.
+ */
+ if ((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MULTIUSER) &&
+ !tcon->unix_ext) {
+ if (!(cifs_sb->mnt_cifs_flags & CIFS_MOUNT_OVERR_UID))
+ stat->uid = current_fsuid();
+ if (!(cifs_sb->mnt_cifs_flags & CIFS_MOUNT_OVERR_GID))
+ stat->gid = current_fsgid();
+ }
+ if (cifs_i->uid_faked)
+ stat->result_mask &= ~XSTAT_UID;
+ if (cifs_i->gid_faked)
+ stat->result_mask &= ~XSTAT_GID;
+
+ attrs = cifs_i->cifsAttrs;
+ if (attrs & ATTR_HIDDEN) info |= XSTAT_INFO_HIDDEN;
+ if (attrs & ATTR_SYSTEM) info |= XSTAT_INFO_SYSTEM;
+ if (attrs & ATTR_ARCHIVE) info |= XSTAT_INFO_ARCHIVE;
+ if (attrs & ATTR_TEMPORARY) info |= XSTAT_INFO_TEMPORARY;
+ if (attrs & ATTR_REPARSE) info |= XSTAT_INFO_REPARSE_POINT;
+ if (attrs & ATTR_OFFLINE) info |= XSTAT_INFO_OFFLINE;
+ if (attrs & ATTR_ENCRYPTED) info |= XSTAT_INFO_ENCRYPTED;
+ stat->information |= info;
+
+ if (attrs & ATTR_READONLY) ioc |= FS_IMMUTABLE_FL;
+ if (attrs & ATTR_COMPRESSED) ioc |= FS_COMPR_FL;
+ stat->ioc_flags |= ioc;
+ return 0;
}

static int cifs_truncate_page(struct address_space *mapping, loff_t from)


2012-04-19 14:35:50

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 4/6] xstat: NFS: Return extended attributes

On Thu, 2012-04-19 at 15:06 +0100, David Howells wrote:
> Return extended attributes from the NFS filesystem. This includes the
> following:
>
> (1) The change attribute as st_data_version if NFSv4.
>
> (2) XSTAT_INFO_AUTOMOUNT and XSTAT_INFO_FABRICATED are set on referral or
> submount directories that are automounted upon. NFS shows one directory
> with a different FSID, but the local filesystem has two: the mountpoint
> directory and the root of the filesystem mounted upon it.
>
> (3) XSTAT_INFO_REMOTE is set on files acquired over NFS.
>
> Furthermore, what nfs_getattr() does can be controlled as follows:
>
> (1) If AT_FORCE_ATTR_SYNC is indicated, or mtime, ctime or data_version (NFSv4
> only) are requested then the outstanding writes will be written to the
> server first.
>
> (2) The inode's attributes may be synchronised with the server:
>
> (a) If AT_FORCE_ATTR_SYNC is indicated or if atime is requested (and atime
> updating is not suppressed by a mount flag) then the attributes will
> be reread unconditionally.
>
> (b) If the data version or any of basic stats are requested then the
> attributes will be reread if the cached attributes have expired.
>
> (c) Otherwise the cached attributes will be used - even if expired -
> without reference to the server.

Hmm... As far as I can see you are still doing an nfs_revalidate_inode()
in the non-forced case. That will cause expired attributes to be
retrieved from the server.

> Signed-off-by: David Howells <[email protected]>
> ---
>
> fs/nfs/inode.c | 49 +++++++++++++++++++++++++++++++++++++------------
> fs/nfs/super.c | 1 +
> 2 files changed, 38 insertions(+), 12 deletions(-)
>
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index e8bbfa5..460fcf3 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -509,11 +509,18 @@ void nfs_setattr_update_inode(struct inode *inode, struct iattr *attr)
> int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
> {
> struct inode *inode = dentry->d_inode;
> + unsigned force = stat->query_flags & AT_FORCE_ATTR_SYNC;
> int need_atime = NFS_I(inode)->cache_validity & NFS_INO_INVALID_ATIME;
> int err;
>
> - /* Flush out writes to the server in order to update c/mtime. */
> - if (S_ISREG(inode->i_mode)) {
> + if (NFS_SERVER(inode)->nfs_client->rpc_ops->version < 4)
> + stat->request_mask &= ~XSTAT_VERSION;
> +
> + /* Flush out writes to the server in order to update c/mtime
> + * or data version if the user wants them */
> + if ((force || (stat->request_mask &
> + (XSTAT_MTIME | XSTAT_CTIME | XSTAT_VERSION))) &&
> + S_ISREG(inode->i_mode)) {
> err = filemap_write_and_wait(inode->i_mapping);

We can get rid of the filemap_write_and_wait() if the caller allows us
to approximate m/ctime values. That would give a major speed-up for most
stat() workloads.

> if (err)
> goto out;
> @@ -528,18 +535,36 @@ int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
> * - NFS never sets MS_NOATIME or MS_NODIRATIME so there is
> * no point in checking those.
> */
> - if ((mnt->mnt_flags & MNT_NOATIME) ||
> - ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode)))
> + if (mnt->mnt_flags & MNT_NOATIME ||
> + (mnt->mnt_flags & MNT_NODIRATIME && S_ISDIR(inode->i_mode))) {
> + stat->ioc_flags |= FS_NOATIME_FL;
> + need_atime = 0;
> + } else if (!(stat->request_mask & XSTAT_ATIME)) {
> need_atime = 0;
> + }
>
> - if (need_atime)
> - err = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
> - else
> - err = nfs_revalidate_inode(NFS_SERVER(inode), inode);
> - if (!err) {
> - generic_fillattr(inode, stat);
> - stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
> + if (force || stat->request_mask & (XSTAT_BASIC_STATS | XSTAT_VERSION)) {
> + if (force || need_atime)
> + err = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
> + else
> + err = nfs_revalidate_inode(NFS_SERVER(inode), inode);
> + if (err)
> + goto out;
> }
> +
> + generic_fillattr(inode, stat);
> + stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
> +
> + if (stat->request_mask & XSTAT_VERSION) {
> + stat->version = inode->i_version;
> + stat->result_mask |= XSTAT_VERSION;
> + }
> +
> + if (IS_AUTOMOUNT(inode))
> + stat->information |= XSTAT_INFO_FABRICATED;
> +
> + stat->information |= XSTAT_INFO_REMOTE;
> +
> out:
> return err;
> }
> @@ -852,7 +877,7 @@ int nfs_revalidate_inode(struct nfs_server *server, struct inode *inode)
> static int nfs_invalidate_mapping(struct inode *inode, struct address_space *mapping)
> {
> struct nfs_inode *nfsi = NFS_I(inode);
> -
> +
> if (mapping->nrpages != 0) {
> int ret = invalidate_inode_pages2(mapping);
> if (ret < 0)
> diff --git a/fs/nfs/super.c b/fs/nfs/super.c
> index 37412f7..faa652c 100644
> --- a/fs/nfs/super.c
> +++ b/fs/nfs/super.c
> @@ -2222,6 +2222,7 @@ static int nfs_set_super(struct super_block *s, void *data)
> ret = set_anon_super(s, server);
> if (ret == 0)
> server->s_dev = s->s_dev;
> + memcpy(&s->s_volume_id, &server->fsid, sizeof(s->s_volume_id));
> return ret;
> }
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

????{.n?+???????+%?????ݶ??w??{.n?+????{??w??)????w*jg????????ݢj???n?r??6;靫3??\鹹bs?&jw??? ???j:+v???w?j?m????????w?????f???h???????٥

2012-04-19 15:19:56

by Steve French

[permalink] [raw]
Subject: Re: [PATCH 5/6] xstat: CIFS: Return extended attributes

For some of our users this would help A LOT.

Interesting ... just had discussions yesterday with some guys trying
to migrate to Linux and another set trying to backup Windows/NetApp
from Linux.

Some things they brought up that they needed (beyond what we already
have wth the cifs acl and SID and "dos attributes" xattrs, which is
even more useful now with the backup intent cifs mount flag) included:
- how do they tell if the inode number for a file was manufactured on
the client, or whether we were able to use the server file's inode
number ("UniqueId")
- how to get birth time (creation time)
- how to tell if file is "offline" (HSM)
- And is there a way to return the other less common cifs attributes
(e.g. "reparse")

Dave's patch seems to address all of that. Samba server stuffs most
of this in an ndr encoded xattr blob which isn't much good for kernel
code to use, and I really prefer Dave's approach. Without this, I
would need to add another cifs specific ioctl, but since there is
significant overlap between some of these and ntfs, vfat, nfs etc. I
like the xstat idea better.



On Thu, Apr 19, 2012 at 9:07 AM, David Howells <[email protected]> wrote:
> Return extended attributes from the CIFS filesystem. ?This includes the
> following:
>
> ?(1) Return the file creation time as btime. ?We assume that the creation time
> ? ? won't change over the life of the inode.
>
> ?(2) Set XSTAT_INFO_AUTOMOUNT on referral/submount directories.
>
> ?(3) Unset XSTAT_INO if we made up the inode number and didn't get it from the
> ? ? server.
>
> ?(4) Unset XSTAT_[UG]ID if we are either returning values passed to mount
> ? ? and/or the server doesn't return them.
>
> ?(5) Map various Windows file attributes to FS_xxx_FL flags in st_ioc_flags
> ? ? and XSTAT_INFO_xxx flags in st_information, fetching them from the server
> ? ? if we don't have them yet or don't have a current copy.
>
> ? ? Possibly things like Hidden, System and Archive should be FS_xxx_FL flags
> ? ? rather than XSTAT_INFO_xxx flags and st_ioc_flags should be expanded to
> ? ? 64 bits.
>
> ?(6) Set XSTAT_INFO_REMOTE on all files fetched by CIFS.
>
> ?(7) Set XSTAT_INFO_NONSYSTEM_OWNERSHIP on all files as they all have Windows
> ? ? ownership details too.
>
> ?(8) Set XSTAT_INFO_HAS_ACL if CONFIG_CIFS_ACL=y as Windows ACLs are available
> ? ? on the object.
>
> Furthermore, what cifs_getattr() does can be controlled as follows:
>
> ?(1) If AT_FORCE_ATTR_SYNC is indicated, or if the inode flags or creation time
> ? ? are requested but not yet collected, then the attributes will be reread
> ? ? unconditionally.
>
> ?(2) If the basic stats are requested or if the inode flags are requested and
> ? ? have been collected previously, then the attributes will be reread if out
> ? ? of date.
>
> ?(3) Otherwise the cached attributes will be used - even if expired - without
> ? ? reference to the server.
>
> Note that cifs_revalidate_dentry() will issue an extra operation to get the
> FILE_ALL_INFO in addition to the FILE_UNIX_BASIC_INFO if it needs to collect
> creation time and attributes on behalf of cifs_getattr().
>
> [NOTE: THIS PATCH IS UNTESTED!]
>
> Signed-off-by: David Howells <[email protected]>
> ---
>
> ?fs/cifs/cifsfs.h ? | ? ?4 +-
> ?fs/cifs/cifsglob.h | ? 16 +++++--
> ?fs/cifs/dir.c ? ? ?| ? ?2 -
> ?fs/cifs/inode.c ? ?| ?120 +++++++++++++++++++++++++++++++++++++++++++++-------
> ?4 files changed, 118 insertions(+), 24 deletions(-)
>
> diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
> index d1389bb..021e327 100644
> --- a/fs/cifs/cifsfs.h
> +++ b/fs/cifs/cifsfs.h
> @@ -56,9 +56,9 @@ extern int cifs_rmdir(struct inode *, struct dentry *);
> ?extern int cifs_rename(struct inode *, struct dentry *, struct inode *,
> ? ? ? ? ? ? ? ? ? ? ? struct dentry *);
> ?extern int cifs_revalidate_file_attr(struct file *filp);
> -extern int cifs_revalidate_dentry_attr(struct dentry *);
> +extern int cifs_revalidate_dentry_attr(struct dentry *, bool, bool);
> ?extern int cifs_revalidate_file(struct file *filp);
> -extern int cifs_revalidate_dentry(struct dentry *);
> +extern int cifs_revalidate_dentry(struct dentry *, bool, bool);
> ?extern int cifs_invalidate_mapping(struct inode *inode);
> ?extern int cifs_getattr(struct vfsmount *, struct dentry *, struct kstat *);
> ?extern int cifs_setattr(struct dentry *, struct iattr *);
> diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
> index 4ff6313..d3567da 100644
> --- a/fs/cifs/cifsglob.h
> +++ b/fs/cifs/cifsglob.h
> @@ -621,11 +621,15 @@ struct cifsInodeInfo {
> ? ? ? ?/* BB add in lists for dirty pages i.e. write caching info for oplock */
> ? ? ? ?struct list_head openFileList;
> ? ? ? ?__u32 cifsAttrs; /* e.g. DOS archive bit, sparse, compressed, system */
> - ? ? ? bool clientCanCacheRead; ? ? ? ?/* read oplock */
> - ? ? ? bool clientCanCacheAll; ? ? ? ? /* read and writebehind oplock */
> - ? ? ? bool delete_pending; ? ? ? ? ? ?/* DELETE_ON_CLOSE is set */
> - ? ? ? bool invalid_mapping; ? ? ? ? ? /* pagecache is invalid */
> + ? ? ? bool clientCanCacheRead:1; ? ? ?/* read oplock */
> + ? ? ? bool clientCanCacheAll:1; ? ? ? /* read and writebehind oplock */
> + ? ? ? bool delete_pending:1; ? ? ? ? ?/* DELETE_ON_CLOSE is set */
> + ? ? ? bool invalid_mapping:1; ? ? ? ? /* pagecache is invalid */
> + ? ? ? bool btime_valid:1; ? ? ? ? ? ? /* stored creation time is valid */
> + ? ? ? bool uid_faked:1; ? ? ? ? ? ? ? /* true if i_uid is faked */
> + ? ? ? bool gid_faked:1; ? ? ? ? ? ? ? /* true if i_gid is faked */
> ? ? ? ?unsigned long time; ? ? ? ? ? ? /* jiffies of last update of inode */
> + ? ? ? struct timespec btime; ? ? ? ? ?/* creation time */
> ? ? ? ?u64 ?server_eof; ? ? ? ? ? ? ? ?/* current file size on server -- protected by i_lock */
> ? ? ? ?u64 ?uniqueid; ? ? ? ? ? ? ? ? ?/* server inode number */
> ? ? ? ?u64 ?createtime; ? ? ? ? ? ? ? ?/* creation time on server */
> @@ -833,6 +837,9 @@ struct dfs_info3_param {
> ?#define CIFS_FATTR_DELETE_PENDING ? ? ?0x2
> ?#define CIFS_FATTR_NEED_REVAL ? ? ? ? ?0x4
> ?#define CIFS_FATTR_INO_COLLISION ? ? ? 0x8
> +#define CIFS_FATTR_WINATTRS_VALID ? ? ?0x10 ? ?/* T if cf_btime and cf_cifsattrs valid */
> +#define CIFS_FATTR_UID_FAKED ? ? ? ? ? 0x20 ? ?/* T if cf_uid is faked */
> +#define CIFS_FATTR_GID_FAKED ? ? ? ? ? 0x40 ? ?/* T if cf_gid is faked */
>
> ?struct cifs_fattr {
> ? ? ? ?u32 ? ? ? ? ? ? cf_flags;
> @@ -850,6 +857,7 @@ struct cifs_fattr {
> ? ? ? ?struct timespec cf_atime;
> ? ? ? ?struct timespec cf_mtime;
> ? ? ? ?struct timespec cf_ctime;
> + ? ? ? struct timespec cf_btime;
> ?};
>
> ?static inline void free_dfs_info_param(struct dfs_info3_param *param)
> diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
> index d172c8e..d9e03ae 100644
> --- a/fs/cifs/dir.c
> +++ b/fs/cifs/dir.c
> @@ -664,7 +664,7 @@ cifs_d_revalidate(struct dentry *direntry, struct nameidata *nd)
> ? ? ? ? ? ? ? ?return -ECHILD;
>
> ? ? ? ?if (direntry->d_inode) {
> - ? ? ? ? ? ? ? if (cifs_revalidate_dentry(direntry))
> + ? ? ? ? ? ? ? if (cifs_revalidate_dentry(direntry, false, false))
> ? ? ? ? ? ? ? ? ? ? ? ?return 0;
> ? ? ? ? ? ? ? ?else {
> ? ? ? ? ? ? ? ? ? ? ? ?/*
> diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
> index 745da3d..662d5ce 100644
> --- a/fs/cifs/inode.c
> +++ b/fs/cifs/inode.c
> @@ -135,13 +135,21 @@ cifs_fattr_to_inode(struct inode *inode, struct cifs_fattr *fattr)
> ? ? ? ?set_nlink(inode, fattr->cf_nlink);
> ? ? ? ?inode->i_uid = fattr->cf_uid;
> ? ? ? ?inode->i_gid = fattr->cf_gid;
> + ? ? ? if (fattr->cf_flags & CIFS_FATTR_UID_FAKED)
> + ? ? ? ? ? ? ? cifs_i->uid_faked = true;
> + ? ? ? if (fattr->cf_flags & CIFS_FATTR_GID_FAKED)
> + ? ? ? ? ? ? ? cifs_i->gid_faked = true;
>
> ? ? ? ?/* if dynperm is set, don't clobber existing mode */
> ? ? ? ?if (inode->i_state & I_NEW ||
> ? ? ? ? ? ?!(cifs_sb->mnt_cifs_flags & CIFS_MOUNT_DYNPERM))
> ? ? ? ? ? ? ? ?inode->i_mode = fattr->cf_mode;
>
> - ? ? ? cifs_i->cifsAttrs = fattr->cf_cifsattrs;
> + ? ? ? if (fattr->cf_flags & CIFS_FATTR_WINATTRS_VALID) {
> + ? ? ? ? ? ? ? cifs_i->cifsAttrs = fattr->cf_cifsattrs;
> + ? ? ? ? ? ? ? cifs_i->btime = fattr->cf_btime;
> + ? ? ? ? ? ? ? cifs_i->btime_valid = true;
> + ? ? ? }
>
> ? ? ? ?if (fattr->cf_flags & CIFS_FATTR_NEED_REVAL)
> ? ? ? ? ? ? ? ?cifs_i->time = 0;
> @@ -248,15 +256,19 @@ cifs_unix_basic_to_fattr(struct cifs_fattr *fattr, FILE_UNIX_BASIC_INFO *info,
> ? ? ? ? ? ? ? ?break;
> ? ? ? ?}
>
> - ? ? ? if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_OVERR_UID)
> + ? ? ? if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_OVERR_UID) {
> ? ? ? ? ? ? ? ?fattr->cf_uid = cifs_sb->mnt_uid;
> - ? ? ? else
> + ? ? ? ? ? ? ? fattr->cf_flags |= CIFS_FATTR_UID_FAKED;
> + ? ? ? } else {
> ? ? ? ? ? ? ? ?fattr->cf_uid = le64_to_cpu(info->Uid);
> + ? ? ? }
>
> - ? ? ? if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_OVERR_GID)
> + ? ? ? if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_OVERR_GID) {
> ? ? ? ? ? ? ? ?fattr->cf_gid = cifs_sb->mnt_gid;
> - ? ? ? else
> + ? ? ? ? ? ? ? fattr->cf_flags |= CIFS_FATTR_GID_FAKED;
> + ? ? ? } else {
> ? ? ? ? ? ? ? ?fattr->cf_gid = le64_to_cpu(info->Gid);
> + ? ? ? }
>
> ? ? ? ?fattr->cf_nlink = le64_to_cpu(info->Nlinks);
> ?}
> @@ -283,7 +295,8 @@ cifs_create_dfs_fattr(struct cifs_fattr *fattr, struct super_block *sb)
> ? ? ? ?fattr->cf_ctime = CURRENT_TIME;
> ? ? ? ?fattr->cf_mtime = CURRENT_TIME;
> ? ? ? ?fattr->cf_nlink = 2;
> - ? ? ? fattr->cf_flags |= CIFS_FATTR_DFS_REFERRAL;
> + ? ? ? fattr->cf_flags |= CIFS_FATTR_DFS_REFERRAL |
> + ? ? ? ? ? ? ? CIFS_FATTR_UID_FAKED | CIFS_FATTR_GID_FAKED;
> ?}
>
> ?int cifs_get_file_info_unix(struct file *filp)
> @@ -510,6 +523,7 @@ cifs_all_info_to_fattr(struct cifs_fattr *fattr, FILE_ALL_INFO *info,
> ? ? ? ?struct cifs_tcon *tcon = cifs_sb_master_tcon(cifs_sb);
>
> ? ? ? ?memset(fattr, 0, sizeof(*fattr));
> + ? ? ? fattr->cf_flags = CIFS_FATTR_WINATTRS_VALID;
> ? ? ? ?fattr->cf_cifsattrs = le32_to_cpu(info->Attributes);
> ? ? ? ?if (info->DeletePending)
> ? ? ? ? ? ? ? ?fattr->cf_flags |= CIFS_FATTR_DELETE_PENDING;
> @@ -521,6 +535,7 @@ cifs_all_info_to_fattr(struct cifs_fattr *fattr, FILE_ALL_INFO *info,
>
> ? ? ? ?fattr->cf_ctime = cifs_NTtimeToUnix(info->ChangeTime);
> ? ? ? ?fattr->cf_mtime = cifs_NTtimeToUnix(info->LastWriteTime);
> + ? ? ? fattr->cf_btime = cifs_NTtimeToUnix(info->CreationTime);
>
> ? ? ? ?if (adjust_tz) {
> ? ? ? ? ? ? ? ?fattr->cf_ctime.tv_sec += tcon->ses->server->timeAdj;
> @@ -1724,7 +1739,8 @@ int cifs_revalidate_file_attr(struct file *filp)
> ? ? ? ?return rc;
> ?}
>
> -int cifs_revalidate_dentry_attr(struct dentry *dentry)
> +int cifs_revalidate_dentry_attr(struct dentry *dentry,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? bool want_extra_bits, bool force)
> ?{
> ? ? ? ?int xid;
> ? ? ? ?int rc = 0;
> @@ -1735,7 +1751,7 @@ int cifs_revalidate_dentry_attr(struct dentry *dentry)
> ? ? ? ?if (inode == NULL)
> ? ? ? ? ? ? ? ?return -ENOENT;
>
> - ? ? ? if (!cifs_inode_needs_reval(inode))
> + ? ? ? if (!force && !cifs_inode_needs_reval(inode))
> ? ? ? ? ? ? ? ?return rc;
>
> ? ? ? ?xid = GetXid();
> @@ -1752,9 +1768,12 @@ int cifs_revalidate_dentry_attr(struct dentry *dentry)
> ? ? ? ? ? ? ? ? "%ld jiffies %ld", full_path, inode, inode->i_count.counter,
> ? ? ? ? ? ? ? ? dentry, dentry->d_time, jiffies);
>
> - ? ? ? if (cifs_sb_master_tcon(CIFS_SB(sb))->unix_ext)
> + ? ? ? if (cifs_sb_master_tcon(CIFS_SB(sb))->unix_ext) {
> ? ? ? ? ? ? ? ?rc = cifs_get_inode_info_unix(&inode, full_path, sb, xid);
> - ? ? ? else
> + ? ? ? ? ? ? ? if (rc != 0)
> + ? ? ? ? ? ? ? ? ? ? ? goto out;
> + ? ? ? }
> + ? ? ? if (!cifs_sb_master_tcon(CIFS_SB(sb))->unix_ext || want_extra_bits)
> ? ? ? ? ? ? ? ?rc = cifs_get_inode_info(&inode, full_path, NULL, sb,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? xid, NULL);
>
> @@ -1779,12 +1798,13 @@ int cifs_revalidate_file(struct file *filp)
> ?}
>
> ?/* revalidate a dentry's inode attributes */
> -int cifs_revalidate_dentry(struct dentry *dentry)
> +int cifs_revalidate_dentry(struct dentry *dentry,
> + ? ? ? ? ? ? ? ? ? ? ? ? ?bool want_extra_bits, bool force)
> ?{
> ? ? ? ?int rc;
> ? ? ? ?struct inode *inode = dentry->d_inode;
>
> - ? ? ? rc = cifs_revalidate_dentry_attr(dentry);
> + ? ? ? rc = cifs_revalidate_dentry_attr(dentry, want_extra_bits, force);
> ? ? ? ?if (rc)
> ? ? ? ? ? ? ? ?return rc;
>
> @@ -1796,11 +1816,30 @@ int cifs_revalidate_dentry(struct dentry *dentry)
> ?int cifs_getattr(struct vfsmount *mnt, struct dentry *dentry,
> ? ? ? ? ? ? ? ? struct kstat *stat)
> ?{
> + ? ? ? struct cifsInodeInfo *cifs_i = CIFS_I(dentry->d_inode);
> ? ? ? ?struct cifs_sb_info *cifs_sb = CIFS_SB(dentry->d_sb);
> ? ? ? ?struct cifs_tcon *tcon = cifs_sb_master_tcon(cifs_sb);
> ? ? ? ?struct inode *inode = dentry->d_inode;
> + ? ? ? unsigned force = stat->query_flags & AT_FORCE_ATTR_SYNC;
> + ? ? ? bool want_extra_bits = false;
> + ? ? ? u32 info, ioc = 0;
> + ? ? ? u32 attrs;
> ? ? ? ?int rc;
>
> + ? ? ? if (cifs_i->uid_faked)
> + ? ? ? ? ? ? ? stat->request_mask &= ~XSTAT_UID;
> + ? ? ? if (cifs_i->gid_faked)
> + ? ? ? ? ? ? ? stat->request_mask &= ~XSTAT_GID;
> +
> + ? ? ? if (stat->request_mask & XSTAT_BTIME && !cifs_i->btime_valid) {
> + ? ? ? ? ? ? ? want_extra_bits = true;
> + ? ? ? ? ? ? ? force = true;
> + ? ? ? }
> + ? ? ? if (stat->request_mask & XSTAT_IOC_FLAGS) {
> + ? ? ? ? ? ? ? want_extra_bits = true;
> + ? ? ? ? ? ? ? force = true;
> + ? ? ? }
> +
> ? ? ? ?/*
> ? ? ? ? * We need to be sure that all dirty pages are written and the server
> ? ? ? ? * has actual ctime, mtime and file length.
> @@ -1814,13 +1853,14 @@ int cifs_getattr(struct vfsmount *mnt, struct dentry *dentry,
> ? ? ? ? ? ? ? ?}
> ? ? ? ?}
>
> - ? ? ? rc = cifs_revalidate_dentry_attr(dentry);
> - ? ? ? if (rc)
> - ? ? ? ? ? ? ? return rc;
> + ? ? ? if (force || stat->request_mask & XSTAT_BASIC_STATS) {
> + ? ? ? ? ? ? ? rc = cifs_revalidate_dentry(dentry, want_extra_bits, force);
> + ? ? ? ? ? ? ? if (rc)
> + ? ? ? ? ? ? ? ? ? ? ? return rc;
> + ? ? ? }
>
> ? ? ? ?generic_fillattr(inode, stat);
> ? ? ? ?stat->blksize = CIFS_MAX_MSGSIZE;
> - ? ? ? stat->ino = CIFS_I(inode)->uniqueid;
>
> ? ? ? ?/*
> ? ? ? ? * If on a multiuser mount without unix extensions, and the admin hasn't
> @@ -1834,7 +1874,53 @@ int cifs_getattr(struct vfsmount *mnt, struct dentry *dentry,
> ? ? ? ? ? ? ? ?if (!(cifs_sb->mnt_cifs_flags & CIFS_MOUNT_OVERR_GID))
> ? ? ? ? ? ? ? ? ? ? ? ?stat->gid = current_fsgid();
> ? ? ? ?}
> - ? ? ? return rc;
> +
> + ? ? ? info = XSTAT_INFO_REMOTE | XSTAT_INFO_NONSYSTEM_OWNERSHIP;
> +#ifdef CONFIG_CIFS_ACL
> + ? ? ? info |= XSTAT_INFO_HAS_ACL;
> +#endif
> +
> + ? ? ? if (cifs_i->btime_valid) {
> + ? ? ? ? ? ? ? stat->btime = cifs_i->btime;
> + ? ? ? ? ? ? ? stat->result_mask |= XSTAT_BTIME;
> + ? ? ? }
> +
> + ? ? ? /* We don't promise an inode number if we made one up */
> + ? ? ? stat->ino = cifs_i->uniqueid;
> + ? ? ? if (!(cifs_sb->mnt_cifs_flags & CIFS_MOUNT_SERVER_INUM))
> + ? ? ? ? ? ? ? stat->result_mask &= ~XSTAT_INO;
> +
> + ? ? ? /*
> + ? ? ? ?* If on a multiuser mount without unix extensions, and the admin
> + ? ? ? ?* hasn't overridden them, set the ownership to the fsuid/fsgid of the
> + ? ? ? ?* current process.
> + ? ? ? ?*/
> + ? ? ? if ((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MULTIUSER) &&
> + ? ? ? ? ? !tcon->unix_ext) {
> + ? ? ? ? ? ? ? if (!(cifs_sb->mnt_cifs_flags & CIFS_MOUNT_OVERR_UID))
> + ? ? ? ? ? ? ? ? ? ? ? stat->uid = current_fsuid();
> + ? ? ? ? ? ? ? if (!(cifs_sb->mnt_cifs_flags & CIFS_MOUNT_OVERR_GID))
> + ? ? ? ? ? ? ? ? ? ? ? stat->gid = current_fsgid();
> + ? ? ? }
> + ? ? ? if (cifs_i->uid_faked)
> + ? ? ? ? ? ? ? stat->result_mask &= ~XSTAT_UID;
> + ? ? ? if (cifs_i->gid_faked)
> + ? ? ? ? ? ? ? stat->result_mask &= ~XSTAT_GID;
> +
> + ? ? ? attrs = cifs_i->cifsAttrs;
> + ? ? ? if (attrs & ATTR_HIDDEN) ? ? ? ?info |= XSTAT_INFO_HIDDEN;
> + ? ? ? if (attrs & ATTR_SYSTEM) ? ? ? ?info |= XSTAT_INFO_SYSTEM;
> + ? ? ? if (attrs & ATTR_ARCHIVE) ? ? ? info |= XSTAT_INFO_ARCHIVE;
> + ? ? ? if (attrs & ATTR_TEMPORARY) ? ? info |= XSTAT_INFO_TEMPORARY;
> + ? ? ? if (attrs & ATTR_REPARSE) ? ? ? info |= XSTAT_INFO_REPARSE_POINT;
> + ? ? ? if (attrs & ATTR_OFFLINE) ? ? ? info |= XSTAT_INFO_OFFLINE;
> + ? ? ? if (attrs & ATTR_ENCRYPTED) ? ? info |= XSTAT_INFO_ENCRYPTED;
> + ? ? ? stat->information |= info;
> +
> + ? ? ? if (attrs & ATTR_READONLY) ? ? ?ioc |= FS_IMMUTABLE_FL;
> + ? ? ? if (attrs & ATTR_COMPRESSED) ? ?ioc |= FS_COMPR_FL;
> + ? ? ? stat->ioc_flags |= ioc;
> + ? ? ? return 0;
> ?}
>
> ?static int cifs_truncate_page(struct address_space *mapping, loff_t from)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html



--
Thanks,

Steve

2012-04-19 16:32:38

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

I have no comment on the functionality. But "xstat" is probably a poor
choice of name. There is precedent for that function name with different
meaning in the userland APIs. (It's a moderately useless meaning inherited
from SVR4, but regardless overloading a name previously used is unwise.)


Thanks,
Roland

2012-04-19 17:11:11

by Steve French

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On Thu, Apr 19, 2012 at 9:05 AM, David Howells <[email protected]> wrote:
>
> Implement a pair of new system calls to provide extended and further extensible
stat functions.
<snip>
> Should the default for a network fs be to do an unconditional (heavyweight)
> stat with a flag to suppress going to the server to update the locally held
> attributes and flushing pending writebacks?

Even though we can use leases (oplocks) to avoid the roundrtrip, it is
probably too expensive to default to forcing a cache flush, especially
when a common case is to get the file creation time or inode number
information (stable vs volatile).

Would it be better to make the stable vs volatile inode number an attribute
of the volume or something returned by the proposed xstat?

> Should things like the Windows Archive, Hidden and System bits be handled
> through IOC flags, perhaps expanded to 64-bits?

Today I export these through an psuedo-xattr in cifs.ko, I am curious how
NTFS and FAT export these on linux.

> ==========
> TO BE DONE
> ==========
>
> Autofs, ntfs, btrfs, ...

Given the overlap in optional attributes between the network
protocol and local NTFS (and ReFS and to a lesser extent FAT)
I would expect cifs.ko and the ntfs implementations
info to map pretty closely.

> I should perhaps use u8/u32/u64 rather than uint8/32/64_t.
>
> Handle remote filesystems being offline and indicate this with
> XSTAT_INFO_OFFLINE.

You already have support for an indicator for offline files (HSM),
would XSTAT_INFO_OFFLINE be intended for the case
where the network session to the server is disconnected
(and in which you case the application does not want to reconnect)?

--
Thanks,

Steve

2012-04-19 21:51:18

by Paul Eggert

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On 04/19/2012 09:32 AM, Roland McGrath wrote:
> I have no comment on the functionality. But "xstat" is probably a poor
> choice of name.

In AIX 7.1 the (similar) function is called statxat instead of xstat.
The API isn't exactly the same, but it's the same basic idea.
Might be worth looking at, not merely to see whether the API
should be the same, but also to borrow good ideas even if not.

http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.basetechref/doc/basetrf2/statx.htm

2012-04-19 23:06:02

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

statx seems like a better family of names. I also think it's worthwhile to
see if the interface can be made to more closely match the AIX precedent.


Thanks,
Roland

2012-04-19 23:29:28

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On 2012-04-19, at 8:05 AM, David Howells wrote:
> Implement a pair of new system calls to provide extended and further
> extensible stat functions.

Hallelujah for this. I've been waiting/wanting something like this
for ages already. Now if only we can get this landed before it
degrades into the mess it did last time.

> Should fxstat() be implemented as xstat() with a NULL filename,
> using dfd as fd?

I'm personally inclined toward fewer syscalls, especially since
the fstatxat()->statxat() mapping (if I can be so bold as to
prefer the names used later in this thread) is IMHO clear and
unambiguous, and avoids several thin wrappers in the kernel.

> Should the default for a network fs be to do an unconditional
> (heavyweight) stat with a flag to suppress going to the server
> to update the locally held attributes and flushing pending writebacks?

NOOOooo! If application writers are going to use this, they should
request the information needed, and no more. Make no assumptions
about what information is easy or hard for a filesystem to return,
since the overhead can vary wildly depending on the implementation.

Something like "ls --color" (no -l or -s) always stats the file just
to get the mode bits to color executable files differently. Having
to return other information that isn't totally free almost ruins the
benefit of adding a new syscall in the first place.

> Should things like the Windows Archive, Hidden and System bits be
> handled through IOC flags, perhaps expanded to 64-bits?

I'm definitely in favour of a 64-bit IOC flags value, since they are
getting close to running out already. As to whether those other bits
should be merged into the IOC flags, I'm mostly indifferent, but I
lean toward including them since they are definitely related.

I wouldn't object to 64-bit UID/GID values or split 32-bit low/hi UID
and GID values, since NFSv4 and Kerberos realms will likely need this
at some point as well. That said, if the API is extensible, it would
be just as easy to add the low/hi split values when they are needed
in the future.

> Are these things useful to userspace other than Samba and userspace
> NFS servers?

Definitely yes. The GNU fileutils can use a lot of this, since they
are VERY stat() heavy for things like checking st_dev and st_ino
changes during directory traversal, but don't need any of the other
info.

> Is it useful to pass the volume ID out? Or is statfs() sufficient
> for this?

Can't hurt, IMHO. It is a better (more persistent) identifier than
st_dev, and if it is free, or explicitly requested by the application
(Samba, Ganesha, etc) it can be very useful.

Cheers, Andreas






2012-04-26 13:54:15

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

Roland McGrath <[email protected]> wrote:

> I have no comment on the functionality. But "xstat" is probably a poor
> choice of name. There is precedent for that function name with different
> meaning in the userland APIs. (It's a moderately useless meaning inherited
> from SVR4, but regardless overloading a name previously used is unwise.)

I've no particular attachment to the name 'xstat'. If you'd prefer 'statx' I
could go for that.

David

2012-04-26 14:04:07

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

Paul Eggert <[email protected]> wrote:

> On 04/19/2012 09:32 AM, Roland McGrath wrote:
> > I have no comment on the functionality. But "xstat" is probably a poor
> > choice of name.
>
> In AIX 7.1 the (similar) function is called statxat instead of xstat.
> The API isn't exactly the same, but it's the same basic idea.
> Might be worth looking at, not merely to see whether the API
> should be the same, but also to borrow good ideas even if not.
>
> http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.basetechref/doc/basetrf2/statx.htm

Interesting. I wasn't intending to provide both statx() and statxat()
variants, just the latter, in which case I'd've though that -at suffix is
redundant.

I note that they split their time fields into separate seconds and ns fields,
presumably for better packing.

David

2012-04-26 14:16:00

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

Roland McGrath <[email protected]> wrote:

> statx seems like a better family of names. I also think it's worthwhile to
> see if the interface can be made to more closely match the AIX precedent.

I'm not sure we can make Linux xstat (or whatever) match AIX statxat() very
closely, at least from a syscall interface point of view. We really need the
AT_* flags mask to be compatible with the other Linux syscalls and the length
parameter that I originally had got argued out of existence. I would also
like to make it so that there aren't different 32-bit and 64-bit interfaces to
the kernel - though that means a burden on glibc/uClibc/etc., I guess.

I do also want to be able to pass a mask of what we're actually interested in
- but that's not part of the AIX interface.

I also like the idea of having a larger buffer with the ability to return
extra info (such as security label xattrs), but I was told it was a poor idea,
so that got taken out.

David

2012-04-26 14:25:58

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

Steve French <[email protected]> wrote:

> Would it be better to make the stable vs volatile inode number an attribute
> of the volume or something returned by the proposed xstat?

I'm not sure what you mean by a stable vs a volatile inode number.

> > Should things like the Windows Archive, Hidden and System bits be handled
> > through IOC flags, perhaps expanded to 64-bits?
>
> Today I export these through an psuedo-xattr in cifs.ko, I am curious how
> NTFS and FAT export these on linux.

NTFS: Not at all.

FAT: The hidden bit causes the filename to get a dot prepended (and nothing
else is noted).

> > Autofs, ntfs, btrfs, ...
>
> Given the overlap in optional attributes between the network
> protocol and local NTFS (and ReFS and to a lesser extent FAT)
> I would expect cifs.ko and the ntfs implementations
> info to map pretty closely.

Yep. I wasn't going to do more filesystems till we'd finished arguing about
the basic arrangement of things in struct xstat.

> > Handle remote filesystems being offline and indicate this with
> > XSTAT_INFO_OFFLINE.
>
> You already have support for an indicator for offline files (HSM),

HSM?

> would XSTAT_INFO_OFFLINE be intended for the case
> where the network session to the server is disconnected
> (and in which you case the application does not want to reconnect)?

Hmmm... Interesting question. Both NTFS and CIFS have an offline attribute
(which is where I originally got this from) - but should I have a separate
indicator to indicate the client can't access a server over a network
(ie. we've gone to disconnected operation on this file)? E.g. should there be
a XSTAT_INFO_DISCONNECTED too?

David

2012-04-26 13:52:40

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 4/6] xstat: NFS: Return extended attributes

Myklebust, Trond <[email protected]> wrote:

> Hmm... As far as I can see you are still doing an nfs_revalidate_inode()
> in the non-forced case. That will cause expired attributes to be
> retrieved from the server.

Revalidation is only done when you force it or explicitly ask for a basic stat
or the data version number:

- if (need_atime)
- err = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
- else
- err = nfs_revalidate_inode(NFS_SERVER(inode), inode);
- if (!err) {
- generic_fillattr(inode, stat);
- stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
+ if (force || stat->request_mask & (XSTAT_BASIC_STATS | XSTAT_VERSION)) {
+ if (force || need_atime)
+ err = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
+ else
+ err = nfs_revalidate_inode(NFS_SERVER(inode), inode);
+ if (err)
+ goto out;

Unfortunately, I think I have to revalidate if any of XSTAT_BASIC_STATS are
requested to maintain compatibility with stat() so that stat() can be done
with xstat(). On the other hand, stat() could be done by userspace with
xstat() and AT_FORCE_ATTR_SYNC, I suppose.

David

2012-04-26 14:54:26

by Steve French

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On Thu, Apr 26, 2012 at 9:25 AM, David Howells <[email protected]> wrote:
> Steve French <[email protected]> wrote:
>
>> Would it be better to make the stable vs volatile inode number an attribute
>> of the volume ?or something returned by the proposed xstat?
>
> I'm not sure what you mean by a stable vs a volatile inode number.

Both NFS and CIFS (and SMB2) can return inode numbers or equivalent
unique identifier, but in the case of CIFS some old servers don't support the
calls which return inode numbers (or don't return them for all file system
types, Windows FAT?) so in these cases cifs has to create inode
numbers on the fly
on the client. inode numbers created on the client are not "stable" they
can change on unmount/remount (which can cause problems for backup
applications).

Similarly NFSv4 does not require that servers always return stable inode numbers
(that will never change) and introduced a concept of "volatile file handle."
We have run into this in two cases (there are probably more) -
Specialized NFS servers
for HPC which deal with lots of transient inodes, and second those for servers
which base there inode number on path (Windows NFS?). See
http://docs.oracle.com/cd/E19082-01/819-1634/rfsrefer-137/index.html
or the NFSv4 RFC.

Basically the question is whether it is worth reporting a flag on the
call which returns
the inode number to indicate that the inode number is "stable" (would not change
on reboot or reconnection) or "volatile." Since the majority of NFS
and SMB2 servers
can return stable inode numbers, I don't feel strongly about the need
for an indicator
of "stable" vs. "volatile" but I mention it because backup and
migration applications
mention this (if inode numbers are volatile, they may have to check
for hardlinks differently
for example)


>> > Should things like the Windows Archive, Hidden and System bits be handled
>> > through IOC flags, perhaps expanded to 64-bits?
>>
>> Today I export these through an psuedo-xattr in cifs.ko, I am curious how
>> NTFS and FAT export these on linux.
>
> NTFS: Not at all.
>
> FAT: The hidden bit causes the filename to get a dot prepended (and nothing
> else is noted).
>
>> > Autofs, ntfs, btrfs, ...
>>
>> Given the overlap in optional attributes between the network
>> protocol and local NTFS (and ReFS and to a lesser extent FAT)
>> I would expect cifs.ko and the ntfs implementations
>> info to map pretty closely.
>
> Yep. ?I wasn't going to do more filesystems till we'd finished arguing about
> the basic arrangement of things in struct xstat.

makes sense


>> > Handle remote filesystems being offline and indicate this with
>> > XSTAT_INFO_OFFLINE.
>>
>> You already have support for an indicator for offline files (HSM),
>
> HSM

HSM is the more general case of two tiered data (disk vs. tape)

en.wikipedia.org/wiki/Hierarchical_storage_management

in the simplest case on "disk" (fast) vs. moved to tape (slow to retrieve)


>> would XSTAT_INFO_OFFLINE be intended for the case
>> where the network session to the server is disconnected
>> (and in which you case the application does not want to reconnect)?
>
> Hmmm... ?Interesting question. ?Both NTFS and CIFS have an offline attribute
> (which is where I originally got this from) - but should I have a separate
> indicator to indicate the client can't access a server over a network
> (ie. we've gone to disconnected operation on this file)? ?E.g. should there be
> a XSTAT_INFO_DISCONNECTED too?

my reaction is no, since it adds complexity. If you do a stat on a disconnected
volume (where the network is temporarily down) reconnection will be attempted.
If reconnection fails then the xstat will either fail or be retried forever
depending on the value of "hard" vs. "soft" mount flag.

--
Thanks,

Steve

2012-04-26 15:25:59

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On Thu, 2012-04-26 at 09:54 -0500, Steve French wrote:
> On Thu, Apr 26, 2012 at 9:25 AM, David Howells <[email protected]> wrote:
> > Steve French <[email protected]> wrote:
> >
> >> Would it be better to make the stable vs volatile inode number an attribute
> >> of the volume or something returned by the proposed xstat?
> >
> > I'm not sure what you mean by a stable vs a volatile inode number.
>
> Both NFS and CIFS (and SMB2) can return inode numbers or equivalent
> unique identifier, but in the case of CIFS some old servers don't support the
> calls which return inode numbers (or don't return them for all file system
> types, Windows FAT?) so in these cases cifs has to create inode
> numbers on the fly
> on the client. inode numbers created on the client are not "stable" they
> can change on unmount/remount (which can cause problems for backup
> applications).
>
> Similarly NFSv4 does not require that servers always return stable inode numbers
> (that will never change) and introduced a concept of "volatile file handle."
> We have run into this in two cases (there are probably more) -
> Specialized NFS servers
> for HPC which deal with lots of transient inodes, and second those for servers
> which base there inode number on path (Windows NFS?). See
> http://docs.oracle.com/cd/E19082-01/819-1634/rfsrefer-137/index.html
> or the NFSv4 RFC.
>
> Basically the question is whether it is worth reporting a flag on the
> call which returns
> the inode number to indicate that the inode number is "stable" (would not change
> on reboot or reconnection) or "volatile." Since the majority of NFS
> and SMB2 servers
> can return stable inode numbers, I don't feel strongly about the need
> for an indicator
> of "stable" vs. "volatile" but I mention it because backup and
> migration applications
> mention this (if inode numbers are volatile, they may have to check
> for hardlinks differently
> for example)

I don't understand. If the filesystem doesn't support real inode
numbers, then why report them at all? What use would an application have
for an inode number that can't be used to identify hard linked files?

Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2012-04-26 15:52:15

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

Steve French <[email protected]> wrote:

> >> Would it be better to make the stable vs volatile inode number an attribute
> >> of the volume or something returned by the proposed xstat?
> >
> > I'm not sure what you mean by a stable vs a volatile inode number.
>
> Both NFS and CIFS (and SMB2) can return inode numbers or equivalent unique
> identifier, but in the case of CIFS some old servers don't support the calls
> which return inode numbers (or don't return them for all file system types,
> Windows FAT?) so in these cases cifs has to create inode numbers on the fly
> on the client. inode numbers created on the client are not "stable" they can
> change on unmount/remount (which can cause problems for backup applications).

In the volatile case you'd probably want to unset XSTAT_INO in st_mask as the
inode number is a local fabrication. However, since there is a remote file ID,
we could add an XSTAT_INFO_FILE_ID flag to indicate there's a standard xattr
holding this. On CIFS this could be the servername + pathname, on NFS this
could be the server address + FH on AFS the cell+volID+FID+uniquifier for
example. That's independent of xstat, however, and wouldn't be returned as
it's a blob that could be quite large.

I presume in some cases, there is not a unique file ID that persists across
rename.

> Similarly NFSv4 does not require that servers always return stable inode
> numbers (that will never change) and introduced a concept of "volatile file
> handle."

Can I presume the inode number cannot be considered stable if the NFS4 FH is
non-volatile? Furthermore, can I presume NFS2/3 inode numbers are supposed to
be stable?

> Basically the question is whether it is worth reporting a flag on the call
> which returns the inode number to indicate that the inode number is "stable"
> (would not change on reboot or reconnection) or "volatile." Since the
> majority of NFS and SMB2 servers can return stable inode numbers, I don't
> feel strongly about the need for an indicator of "stable" vs. "volatile" but
> I mention it because backup and migration applications mention this (if inode
> numbers are volatile, they may have to check for hardlinks differently for
> example)

It may be that unsetting XSTAT_INO if you've fabricated the inode number
locally is sufficient.

> >> > Handle remote filesystems being offline and indicate this with
> >> > XSTAT_INFO_OFFLINE.
> >>
> >> You already have support for an indicator for offline files (HSM),

Which indicator is this? Or do you mean XSTAT_INFO_OFFLINE?

> >> would XSTAT_INFO_OFFLINE be intended for the case
> >> where the network session to the server is disconnected
> >> (and in which you case the application does not want to reconnect)?
> >
> > Hmmm... Interesting question. Both NTFS and CIFS have an offline
> > attribute (which is where I originally got this from) - but should I have a
> > separate indicator to indicate the client can't access a server over a
> > network (ie. we've gone to disconnected operation on this file)?
> > E.g. should there be a XSTAT_INFO_DISCONNECTED too?
>
> my reaction is no, since it adds complexity. If you do a stat on a
> disconnected volume (where the network is temporarily down) reconnection will
> be attempted. If reconnection fails then the xstat will either fail or be
> retried forever depending on the value of "hard" vs. "soft" mount flag.

I was thinking of how to handle disconnected operation, where you can't just
sit there and churn waiting for the server to come back or give an error. On
the other hand, as long as there's some spare space in the struct, we can deal
with that later when we actually start to implement D/O.

David

2012-04-26 16:56:17

by Steve French

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On Thu, Apr 26, 2012 at 10:25 AM, Myklebust, Trond
<[email protected]> wrote:
> On Thu, 2012-04-26 at 09:54 -0500, Steve French wrote:
>> On Thu, Apr 26, 2012 at 9:25 AM, David Howells <[email protected]> wrote:
>> > Steve French <[email protected]> wrote:
>> >
>> >> Would it be better to make the stable vs volatile inode number an attribute
>> >> of the volume ?or something returned by the proposed xstat?
>> >
>> > I'm not sure what you mean by a stable vs a volatile inode number.
>>
>> Both NFS and CIFS (and SMB2) can return inode numbers or equivalent
>> unique identifier, but in the case of CIFS some old servers don't support the
>> calls which return inode numbers (or don't return them for all file system
>> types, Windows FAT?) so in these cases cifs has to create inode
>> numbers on the fly
>> on the client. ? inode numbers created on the client are not "stable" they
>> can change on unmount/remount (which can cause problems for backup
>> applications).
>>
>> Similarly NFSv4 does not require that servers always return stable inode numbers
>> (that will never change) and introduced a concept of "volatile file handle."
>> We have run into this in two cases (there are probably more) -
>> Specialized NFS servers
>> for HPC which deal with lots of transient inodes, and second those for servers
>> which base there inode number on path (Windows NFS?). ?See
>> http://docs.oracle.com/cd/E19082-01/819-1634/rfsrefer-137/index.html
>> or the NFSv4 RFC.
>>
>> Basically the question is whether it is worth reporting a flag on the
>> call which returns
>> the inode number to indicate that the inode number is "stable" (would not change
>> on reboot or reconnection) or "volatile." ? ?Since the majority of NFS
>> and SMB2 servers
>> can return stable inode numbers, I don't feel strongly about the need
>> for an indicator
>> of "stable" vs. "volatile" but I mention it because backup and
>> migration applications
>> mention this (if inode numbers are volatile, they may have to check
>> for hardlinks differently
>> for example)
>
> I don't understand. If the filesystem doesn't support real inode
> numbers, then why report them at all? What use would an application have
> for an inode number that can't be used to identify hard linked files?

Well ... you have to have an inode number on the Linux client side even if
the server doesn't report them (or has a bug and reports duplicates).
If you can't tell hardlinked files apart fix the server (but in the
cases where the file systems has this problem the server doesn't usually
support hardlinks either).

If the server's file system internal structures don't support real inode
numbers (such as FAT or a ramdisk) then it either has to make them
up based on something like path name or some other attribute of the
file on disk.

Servers like NetApp is where this gets interesting - for cifs e.g. level 1009
query file info is used to query_file_internal_info (the inode number) but
what if the server can not report inode numbers (due to a bug) in
all cases.

--
Thanks,

Steve

2012-04-26 17:00:41

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On Thu, 2012-04-26 at 11:56 -0500, Steve French wrote:
> On Thu, Apr 26, 2012 at 10:25 AM, Myklebust, Trond
> <[email protected]> wrote:
> > On Thu, 2012-04-26 at 09:54 -0500, Steve French wrote:
> >> On Thu, Apr 26, 2012 at 9:25 AM, David Howells <[email protected]> wrote:
> >> > Steve French <[email protected]> wrote:
> >> >
> >> >> Would it be better to make the stable vs volatile inode number an attribute
> >> >> of the volume or something returned by the proposed xstat?
> >> >
> >> > I'm not sure what you mean by a stable vs a volatile inode number.
> >>
> >> Both NFS and CIFS (and SMB2) can return inode numbers or equivalent
> >> unique identifier, but in the case of CIFS some old servers don't support the
> >> calls which return inode numbers (or don't return them for all file system
> >> types, Windows FAT?) so in these cases cifs has to create inode
> >> numbers on the fly
> >> on the client. inode numbers created on the client are not "stable" they
> >> can change on unmount/remount (which can cause problems for backup
> >> applications).
> >>
> >> Similarly NFSv4 does not require that servers always return stable inode numbers
> >> (that will never change) and introduced a concept of "volatile file handle."
> >> We have run into this in two cases (there are probably more) -
> >> Specialized NFS servers
> >> for HPC which deal with lots of transient inodes, and second those for servers
> >> which base there inode number on path (Windows NFS?). See
> >> http://docs.oracle.com/cd/E19082-01/819-1634/rfsrefer-137/index.html
> >> or the NFSv4 RFC.
> >>
> >> Basically the question is whether it is worth reporting a flag on the
> >> call which returns
> >> the inode number to indicate that the inode number is "stable" (would not change
> >> on reboot or reconnection) or "volatile." Since the majority of NFS
> >> and SMB2 servers
> >> can return stable inode numbers, I don't feel strongly about the need
> >> for an indicator
> >> of "stable" vs. "volatile" but I mention it because backup and
> >> migration applications
> >> mention this (if inode numbers are volatile, they may have to check
> >> for hardlinks differently
> >> for example)
> >
> > I don't understand. If the filesystem doesn't support real inode
> > numbers, then why report them at all? What use would an application have
> > for an inode number that can't be used to identify hard linked files?
>
> Well ... you have to have an inode number on the Linux client side even if
> the server doesn't report them (or has a bug and reports duplicates).
> If you can't tell hardlinked files apart fix the server (but in the
> cases where the file systems has this problem the server doesn't usually
> support hardlinks either).
>
> If the server's file system internal structures don't support real inode
> numbers (such as FAT or a ramdisk) then it either has to make them
> up based on something like path name or some other attribute of the
> file on disk.
>
> Servers like NetApp is where this gets interesting - for cifs e.g. level 1009
> query file info is used to query_file_internal_info (the inode number) but
> what if the server can not report inode numbers (due to a bug) in
> all cases.

Right, but none of this explains why we need to report these bogus inode
numbers to the application in the xstat() reply.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

????{.n?+???????+%?????ݶ??w??{.n?+????{??w??)????w*jg????????ݢj???n?r??6;靫3??\鹹bs?&jw??? ???j:+v???w?j?m????????w?????f???h???????٥

2012-04-26 17:03:07

by Steve French

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On Thu, Apr 26, 2012 at 12:00 PM, Myklebust, Trond
<[email protected]> wrote:
> On Thu, 2012-04-26 at 11:56 -0500, Steve French wrote:
>> On Thu, Apr 26, 2012 at 10:25 AM, Myklebust, Trond
>> <[email protected]> wrote:
>> > On Thu, 2012-04-26 at 09:54 -0500, Steve French wrote:
>> >> On Thu, Apr 26, 2012 at 9:25 AM, David Howells <[email protected]> wrote:
>> >> > Steve French <[email protected]> wrote:
>> >> >
>> >> >> Would it be better to make the stable vs volatile inode number an attribute
>> >> >> of the volume ?or something returned by the proposed xstat?
>> >> >
>> >> > I'm not sure what you mean by a stable vs a volatile inode number.
>> >>
>> >> Both NFS and CIFS (and SMB2) can return inode numbers or equivalent
>> >> unique identifier, but in the case of CIFS some old servers don't support the
>> >> calls which return inode numbers (or don't return them for all file system
>> >> types, Windows FAT?) so in these cases cifs has to create inode
>> >> numbers on the fly
>> >> on the client. ? inode numbers created on the client are not "stable" they
>> >> can change on unmount/remount (which can cause problems for backup
>> >> applications).
>> >>
>> >> Similarly NFSv4 does not require that servers always return stable inode numbers
>> >> (that will never change) and introduced a concept of "volatile file handle."
>> >> We have run into this in two cases (there are probably more) -
>> >> Specialized NFS servers
>> >> for HPC which deal with lots of transient inodes, and second those for servers
>> >> which base there inode number on path (Windows NFS?). ?See
>> >> http://docs.oracle.com/cd/E19082-01/819-1634/rfsrefer-137/index.html
>> >> or the NFSv4 RFC.
>> >>
>> >> Basically the question is whether it is worth reporting a flag on the
>> >> call which returns
>> >> the inode number to indicate that the inode number is "stable" (would not change
>> >> on reboot or reconnection) or "volatile." ? ?Since the majority of NFS
>> >> and SMB2 servers
>> >> can return stable inode numbers, I don't feel strongly about the need
>> >> for an indicator
>> >> of "stable" vs. "volatile" but I mention it because backup and
>> >> migration applications
>> >> mention this (if inode numbers are volatile, they may have to check
>> >> for hardlinks differently
>> >> for example)
>> >
>> > I don't understand. If the filesystem doesn't support real inode
>> > numbers, then why report them at all? What use would an application have
>> > for an inode number that can't be used to identify hard linked files?
>>
>> Well ... you have to have an inode number on the Linux client side even if
>> the server doesn't report them (or has a bug and reports duplicates).
>> If you can't tell hardlinked files apart fix the server (but in the
>> cases where the file systems has this problem the server doesn't usually
>> support hardlinks either).
>>
>> If the server's file system internal structures don't support real inode
>> numbers (such as FAT or a ramdisk) then it either has to make them
>> up based on something like path name or some other attribute of the
>> file on disk.
>>
>> Servers like NetApp is where this gets interesting - for cifs e.g. level 1009
>> query file info is used to query_file_internal_info (the inode number) but
>> what if the server can not report inode numbers (due to a bug) in
>> all cases.
>
> Right, but none of this explains why we need to report these bogus inode
> numbers to the application in the xstat() reply.

the question is whether the application (backup) would need to know
that the inode numbers are bogus and from my conversations with
guys writing backup software it seems that such data is useful to them.


--
Thanks,

Steve

2012-04-26 17:06:51

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On Thu, 2012-04-26 at 12:03 -0500, Steve French wrote:
> On Thu, Apr 26, 2012 at 12:00 PM, Myklebust, Trond
> <[email protected]> wrote:
> > On Thu, 2012-04-26 at 11:56 -0500, Steve French wrote:
> >> On Thu, Apr 26, 2012 at 10:25 AM, Myklebust, Trond
> >> <[email protected]> wrote:
> >> > On Thu, 2012-04-26 at 09:54 -0500, Steve French wrote:
> >> >> On Thu, Apr 26, 2012 at 9:25 AM, David Howells <[email protected]> wrote:
> >> >> > Steve French <[email protected]> wrote:
> >> >> >
> >> >> >> Would it be better to make the stable vs volatile inode number an attribute
> >> >> >> of the volume or something returned by the proposed xstat?
> >> >> >
> >> >> > I'm not sure what you mean by a stable vs a volatile inode number.
> >> >>
> >> >> Both NFS and CIFS (and SMB2) can return inode numbers or equivalent
> >> >> unique identifier, but in the case of CIFS some old servers don't support the
> >> >> calls which return inode numbers (or don't return them for all file system
> >> >> types, Windows FAT?) so in these cases cifs has to create inode
> >> >> numbers on the fly
> >> >> on the client. inode numbers created on the client are not "stable" they
> >> >> can change on unmount/remount (which can cause problems for backup
> >> >> applications).
> >> >>
> >> >> Similarly NFSv4 does not require that servers always return stable inode numbers
> >> >> (that will never change) and introduced a concept of "volatile file handle."
> >> >> We have run into this in two cases (there are probably more) -
> >> >> Specialized NFS servers
> >> >> for HPC which deal with lots of transient inodes, and second those for servers
> >> >> which base there inode number on path (Windows NFS?). See
> >> >> http://docs.oracle.com/cd/E19082-01/819-1634/rfsrefer-137/index.html
> >> >> or the NFSv4 RFC.
> >> >>
> >> >> Basically the question is whether it is worth reporting a flag on the
> >> >> call which returns
> >> >> the inode number to indicate that the inode number is "stable" (would not change
> >> >> on reboot or reconnection) or "volatile." Since the majority of NFS
> >> >> and SMB2 servers
> >> >> can return stable inode numbers, I don't feel strongly about the need
> >> >> for an indicator
> >> >> of "stable" vs. "volatile" but I mention it because backup and
> >> >> migration applications
> >> >> mention this (if inode numbers are volatile, they may have to check
> >> >> for hardlinks differently
> >> >> for example)
> >> >
> >> > I don't understand. If the filesystem doesn't support real inode
> >> > numbers, then why report them at all? What use would an application have
> >> > for an inode number that can't be used to identify hard linked files?
> >>
> >> Well ... you have to have an inode number on the Linux client side even if
> >> the server doesn't report them (or has a bug and reports duplicates).
> >> If you can't tell hardlinked files apart fix the server (but in the
> >> cases where the file systems has this problem the server doesn't usually
> >> support hardlinks either).
> >>
> >> If the server's file system internal structures don't support real inode
> >> numbers (such as FAT or a ramdisk) then it either has to make them
> >> up based on something like path name or some other attribute of the
> >> file on disk.
> >>
> >> Servers like NetApp is where this gets interesting - for cifs e.g. level 1009
> >> query file info is used to query_file_internal_info (the inode number) but
> >> what if the server can not report inode numbers (due to a bug) in
> >> all cases.
> >
> > Right, but none of this explains why we need to report these bogus inode
> > numbers to the application in the xstat() reply.
>
> the question is whether the application (backup) would need to know
> that the inode numbers are bogus and from my conversations with
> guys writing backup software it seems that such data is useful to them.

You are still not explaining why they need to know the values at all? If
the values are bogus, then don't return them, and don't set the flag
that says they are being returned.

Trond

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2012-04-26 17:09:38

by Steve French

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On Thu, Apr 26, 2012 at 12:06 PM, Myklebust, Trond
<[email protected]> wrote:
> On Thu, 2012-04-26 at 12:03 -0500, Steve French wrote:
>> On Thu, Apr 26, 2012 at 12:00 PM, Myklebust, Trond
>> <[email protected]> wrote:
>> > On Thu, 2012-04-26 at 11:56 -0500, Steve French wrote:
>> >> On Thu, Apr 26, 2012 at 10:25 AM, Myklebust, Trond
>> >> <[email protected]> wrote:
>> >> > On Thu, 2012-04-26 at 09:54 -0500, Steve French wrote:
>> >> >> On Thu, Apr 26, 2012 at 9:25 AM, David Howells <[email protected]> wrote:
>> >> >> > Steve French <[email protected]> wrote:
>> >> >> >
>> >> >> >> Would it be better to make the stable vs volatile inode number an attribute
>> >> >> >> of the volume ?or something returned by the proposed xstat?
>> >> >> >
>> >> >> > I'm not sure what you mean by a stable vs a volatile inode number.
>> >> >>
>> >> >> Both NFS and CIFS (and SMB2) can return inode numbers or equivalent
>> >> >> unique identifier, but in the case of CIFS some old servers don't support the
>> >> >> calls which return inode numbers (or don't return them for all file system
>> >> >> types, Windows FAT?) so in these cases cifs has to create inode
>> >> >> numbers on the fly
>> >> >> on the client. ? inode numbers created on the client are not "stable" they
>> >> >> can change on unmount/remount (which can cause problems for backup
>> >> >> applications).
>> >> >>
>> >> >> Similarly NFSv4 does not require that servers always return stable inode numbers
>> >> >> (that will never change) and introduced a concept of "volatile file handle."
>> >> >> We have run into this in two cases (there are probably more) -
>> >> >> Specialized NFS servers
>> >> >> for HPC which deal with lots of transient inodes, and second those for servers
>> >> >> which base there inode number on path (Windows NFS?). ?See
>> >> >> http://docs.oracle.com/cd/E19082-01/819-1634/rfsrefer-137/index.html
>> >> >> or the NFSv4 RFC.
>> >> >>
>> >> >> Basically the question is whether it is worth reporting a flag on the
>> >> >> call which returns
>> >> >> the inode number to indicate that the inode number is "stable" (would not change
>> >> >> on reboot or reconnection) or "volatile." ? ?Since the majority of NFS
>> >> >> and SMB2 servers
>> >> >> can return stable inode numbers, I don't feel strongly about the need
>> >> >> for an indicator
>> >> >> of "stable" vs. "volatile" but I mention it because backup and
>> >> >> migration applications
>> >> >> mention this (if inode numbers are volatile, they may have to check
>> >> >> for hardlinks differently
>> >> >> for example)
>> >> >
>> >> > I don't understand. If the filesystem doesn't support real inode
>> >> > numbers, then why report them at all? What use would an application have
>> >> > for an inode number that can't be used to identify hard linked files?
>> >>
>> >> Well ... you have to have an inode number on the Linux client side even if
>> >> the server doesn't report them (or has a bug and reports duplicates).
>> >> If you can't tell hardlinked files apart fix the server (but in the
>> >> cases where the file systems has this problem the server doesn't usually
>> >> support hardlinks either).
>> >>
>> >> If the server's file system internal structures don't support real inode
>> >> numbers (such as FAT or a ramdisk) then it either has to make them
>> >> up based on something like path name or some other attribute of the
>> >> file on disk.
>> >>
>> >> Servers like NetApp is where this gets interesting - for cifs e.g. level 1009
>> >> query file info is used to query_file_internal_info (the inode number) but
>> >> what if the server can not report inode numbers (due to a bug) in
>> >> all cases.
>> >
>> > Right, but none of this explains why we need to report these bogus inode
>> > numbers to the application in the xstat() reply.
>>
>> the question is whether the application (backup) would need to know
>> that the inode numbers are bogus and from my conversations with
>> guys writing backup software it seems that such data is useful to them.
>
> You are still not explaining why they need to know the values at all? If
> the values are bogus, then don't return them, and don't set the flag
> that says they are being returned.

I don't know, but assumed it was because it was an easy way
to index them since the inode numbers even if they "change"
on remount, are still unique.



--
Thanks,

Steve

2012-04-26 17:10:50

by Steve French

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On Thu, Apr 26, 2012 at 12:09 PM, Steve French <[email protected]> wrote:
> On Thu, Apr 26, 2012 at 12:06 PM, Myklebust, Trond
> <[email protected]> wrote:
>> On Thu, 2012-04-26 at 12:03 -0500, Steve French wrote:
>>> On Thu, Apr 26, 2012 at 12:00 PM, Myklebust, Trond
>>> <[email protected]> wrote:
>>> > On Thu, 2012-04-26 at 11:56 -0500, Steve French wrote:
>>> >> On Thu, Apr 26, 2012 at 10:25 AM, Myklebust, Trond
>>> >> <[email protected]> wrote:
>>> >> > On Thu, 2012-04-26 at 09:54 -0500, Steve French wrote:
>>> >> >> On Thu, Apr 26, 2012 at 9:25 AM, David Howells <[email protected]> wrote:
>>> >> >> > Steve French <[email protected]> wrote:
>>> >> >> >
>>> >> >> >> Would it be better to make the stable vs volatile inode number an attribute
>>> >> >> >> of the volume ?or something returned by the proposed xstat?
>>> >> >> >
>>> >> >> > I'm not sure what you mean by a stable vs a volatile inode number.
>>> >> >>
>>> >> >> Both NFS and CIFS (and SMB2) can return inode numbers or equivalent
>>> >> >> unique identifier, but in the case of CIFS some old servers don't support the
>>> >> >> calls which return inode numbers (or don't return them for all file system
>>> >> >> types, Windows FAT?) so in these cases cifs has to create inode
>>> >> >> numbers on the fly
>>> >> >> on the client. ? inode numbers created on the client are not "stable" they
>>> >> >> can change on unmount/remount (which can cause problems for backup
>>> >> >> applications).
>>> >> >>
>>> >> >> Similarly NFSv4 does not require that servers always return stable inode numbers
>>> >> >> (that will never change) and introduced a concept of "volatile file handle."
>>> >> >> We have run into this in two cases (there are probably more) -
>>> >> >> Specialized NFS servers
>>> >> >> for HPC which deal with lots of transient inodes, and second those for servers
>>> >> >> which base there inode number on path (Windows NFS?). ?See
>>> >> >> http://docs.oracle.com/cd/E19082-01/819-1634/rfsrefer-137/index.html
>>> >> >> or the NFSv4 RFC.
>>> >> >>
>>> >> >> Basically the question is whether it is worth reporting a flag on the
>>> >> >> call which returns
>>> >> >> the inode number to indicate that the inode number is "stable" (would not change
>>> >> >> on reboot or reconnection) or "volatile." ? ?Since the majority of NFS
>>> >> >> and SMB2 servers
>>> >> >> can return stable inode numbers, I don't feel strongly about the need
>>> >> >> for an indicator
>>> >> >> of "stable" vs. "volatile" but I mention it because backup and
>>> >> >> migration applications
>>> >> >> mention this (if inode numbers are volatile, they may have to check
>>> >> >> for hardlinks differently
>>> >> >> for example)
>>> >> >
>>> >> > I don't understand. If the filesystem doesn't support real inode
>>> >> > numbers, then why report them at all? What use would an application have
>>> >> > for an inode number that can't be used to identify hard linked files?
>>> >>
>>> >> Well ... you have to have an inode number on the Linux client side even if
>>> >> the server doesn't report them (or has a bug and reports duplicates).
>>> >> If you can't tell hardlinked files apart fix the server (but in the
>>> >> cases where the file systems has this problem the server doesn't usually
>>> >> support hardlinks either).
>>> >>
>>> >> If the server's file system internal structures don't support real inode
>>> >> numbers (such as FAT or a ramdisk) then it either has to make them
>>> >> up based on something like path name or some other attribute of the
>>> >> file on disk.
>>> >>
>>> >> Servers like NetApp is where this gets interesting - for cifs e.g. level 1009
>>> >> query file info is used to query_file_internal_info (the inode number) but
>>> >> what if the server can not report inode numbers (due to a bug) in
>>> >> all cases.
>>> >
>>> > Right, but none of this explains why we need to report these bogus inode
>>> > numbers to the application in the xstat() reply.
>>>
>>> the question is whether the application (backup) would need to know
>>> that the inode numbers are bogus and from my conversations with
>>> guys writing backup software it seems that such data is useful to them.
>>
>> You are still not explaining why they need to know the values at all? If
>> the values are bogus, then don't return them, and don't set the flag
>> that says they are being returned.
>
> I don't know, but assumed it was because it was an easy way
> to index them since the inode numbers even if they "change"
> on remount, are still unique.

if the call allows the inode number not to be returned, that
is probably ok (they can always use the posix stat to get
the client generated one)



--
Thanks,

Steve

2012-04-26 18:22:02

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

> Roland McGrath <roland-/[email protected]> wrote:
>
> > statx seems like a better family of names. I also think it's worthwhile to
> > see if the interface can be made to more closely match the AIX precedent.
>
> I'm not sure we can make Linux xstat (or whatever) match AIX statxat() very
> closely, at least from a syscall interface point of view.

OK. It was just worth a look.


Thanks,
Roland

2012-04-26 18:24:01

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

> Interesting. I wasn't intending to provide both statx() and statxat()
> variants, just the latter, in which case I'd've though that -at suffix is
> redundant.

It's certainly fine to provide only *at flavors for any new syscall, IMHO.
The * case is always just a simple degenerate case of *at, and libc can
trivially provide the simpler user API as well using the *at syscall.

But please keep the uniformity that everything taking a descriptor and AT_*
flags is named *at.


Thanks,
Roland

2012-04-26 18:25:24

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

> I've no particular attachment to the name 'xstat'. If you'd prefer 'statx' I
> could go for that.

I prefer something other than xstat and statx(at) seems acceptable enough.
What I'd really prefer is a name that is less meaninglessly arcane,
but I haven't thought of any good ones.


Thanks,
Roland

2012-04-26 21:54:41

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

Roland McGrath <roland-/[email protected]> wrote:

> > I've no particular attachment to the name 'xstat'. If you'd prefer
> > 'statx' I could go for that.
>
> I prefer something other than xstat and statx(at) seems acceptable enough.
> What I'd really prefer is a name that is less meaninglessly arcane,
> but I haven't thought of any good ones.

fileinfoat() perhaps? I think stat*at() is better, though, as people are used
to the stat function family.

David

2012-04-26 21:57:56

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

Myklebust, Trond <[email protected]> wrote:

> You are still not explaining why they need to know the values at all? If
> the values are bogus, then don't return them, and don't set the flag
> that says they are being returned.

What if the xstat() and struct xstat eventually becomes what userspace uses as
stat() (as a wrapper) and struct stat (if such a thing is possible with glibc
versioning)? Do older programs that think they're using stat() and don't know
about the extra fields available expect to see a useful value in st_ino?

David

2012-04-26 22:02:42

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

> fileinfoat() perhaps? I think stat*at() is better, though, as people are
> used to the stat function family.

Names like that were all I had thought of when I said I hadn't thought of
any good ones. ;-)

2012-04-26 22:05:52

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

> What if the xstat() and struct xstat eventually becomes what userspace
> uses as stat() (as a wrapper) and struct stat (if such a thing is
> possible with glibc versioning)?

It's certainly possible with symbol versioning, though it seems much more
likely that we'd stick with the existing struct stat and stat* interfaces
and only have the implementation using statx underneath (e.g. for new
machines or kernel ABIs where the kernel stops providing any calls except
for statxat), at least for the foreseeable future.

> Do older programs that think they're using stat() and don't know about
> the extra fields available expect to see a useful value in st_ino?

POSIX requires that st_ino have a useful value for the standard *stat calls.


Thanks,
Roland

2012-04-26 22:21:08

by Nix

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On 26 Apr 2012, Roland McGrath verbalised:

>> fileinfoat() perhaps? I think stat*at() is better, though, as people are
>> used to the stat function family.
>
> Names like that were all I had thought of when I said I hadn't thought of
> any good ones. ;-)

Quite. It's a garden-path function name. "What's a foat and why would I
want to put a file in one?"

--
NULL && (void)

2012-04-27 00:30:30

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On Thu, 2012-04-26 at 22:57 +0100, David Howells wrote:
> Myklebust, Trond <[email protected]> wrote:
>
> > You are still not explaining why they need to know the values at all? If
> > the values are bogus, then don't return them, and don't set the flag
> > that says they are being returned.
> th
> What if the xstat() and struct xstat eventually becomes what userspace uses as
> stat() (as a wrapper) and struct stat (if such a thing is possible with glibc
> versioning)? Do older programs that think they're using stat() and don't know
> about the extra fields available expect to see a useful value in st_ino?

Does it really matter whether it is the kernel or userland that is
responsible for faking up inode numbers? If userland wants to use
xstat() in order to fake up a stat() call, then it gets to take
responsibility for the results.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2012-04-27 00:33:06

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On Thu, 2012-04-26 at 15:05 -0700, Roland McGrath wrote:
> > What if the xstat() and struct xstat eventually becomes what userspace
> > uses as stat() (as a wrapper) and struct stat (if such a thing is
> > possible with glibc versioning)?
>
> It's certainly possible with symbol versioning, though it seems much more
> likely that we'd stick with the existing struct stat and stat* interfaces
> and only have the implementation using statx underneath (e.g. for new
> machines or kernel ABIs where the kernel stops providing any calls except
> for statxat), at least for the foreseeable future.
>
> > Do older programs that think they're using stat() and don't know about
> > the extra fields available expect to see a useful value in st_ino?
>
> POSIX requires that st_ino have a useful value for the standard *stat calls.

Yes, but we're talking about non-POSIX filesystems here. If the
filesystem doesn't have a useful value for st_ino, then the usual way of
dealing with those POSIX requirements is to fake up values. The question
then becomes whether or not we care if it is the kernel or userland that
fakes up those values.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2012-04-27 01:06:10

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On Thu, Apr 19, 2012 at 03:05:58PM +0100, David Howells wrote:
>
> Implement a pair of new system calls to provide extended and further extensible
> stat functions.
>
> The second of the associated patches is the main patch that provides these new
> system calls:
>
> ssize_t ret = xstat(int dfd,
> const char *filename,
> unsigned atflag,
> unsigned mask,
> struct xstat *buffer);
>
> ssize_t ret = fxstat(int fd,
> unsigned atflag,
> unsigned mask,
> struct xstat *buffer);
>
> which are more fully documented in the first patch's description.
>
> These new stat functions provide a number of useful features, in summary:
>
> (1) More information: creation time, inode generation number, data version
> number, flags/attributes. A subset of these is available through a number
> of filesystems (such as CIFS, NFS, AFS, Ext4 and BTRFS).

If we are adding per-inode flags, then what do we do with filesystem
specific flags? e.g. XFS has quite a number of per-inode flags that
don't align with any other filesystem (e.g. filestream allocator,
real time file, behaviour inheritence flags, etc), but may be useful
to retrieve in such a call. We currently have an ioctl to get that
information from each inode. Have you thought about how to handle
such flags?

Along the same lines, filesytsems can have different allocation
constraints to IO the filesystem block size - ext4 with it's
bigalloc hack, XFS with it's per-inode extent size hints and the
realtime device, etc. Then there's optimal IO characteristics
(e.g. geometery hints like stripe unit/stripe width for the
allocation policy of that given file) that applications could use
if they were present rather than having to expose them through
ioctls that nobody even knows about...

Perhaps also exposing the project ID for quota purposes, like we do
UID and GID. That way we wouldn't need a filesystem specific ioctl
to read it....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-04-27 03:08:40

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On 2012-04-26, at 10:52, David Howells <[email protected]> wrote:

> Steve French <[email protected]> wrote:
>>
>> Both NFS and CIFS (and SMB2) can return inode numbers or equivalent unique
>> identifier, but in the case of CIFS some old servers don't support the calls
>> which return inode numbers (or don't return them for all file system types,
>> Windows FAT?) so in these cases cifs has to create inode numbers on the fly
>> on the client. inode numbers created on the client are not "stable" they can
>> change on unmount/remount (which can cause problems for backup applications).
>
> In the volatile case you'd probably want to unset XSTAT_INO in st_mask as the
> inode number is a local fabrication.

I'd agree. Why fake up an inode number if the application doesn't care? Most apps don't actually use the inode. The only uses I know for the inode number in userspace are backup, CIFS/NFS servers, and "ls -li" .

> However, since there is a remote file ID,
> we could add an XSTAT_INFO_FILE_ID flag to indicate there's a standard xattr
> holding this.

It is a bit strange that the kernel would return a flag that was not requested, but not fatal.

> On CIFS this could be the servername + pathname, on NFS this
> could be the server address + FH on AFS the cell+volID+FID+uniquifier for
> example. That's independent of xstat, however, and wouldn't be returned as
> it's a blob that could be quite large.
>
> I presume in some cases, there is not a unique file ID that persists across
> rename.
>
>> Similarly NFSv4 does not require that servers always return stable inode
>> numbers (that will never change) and introduced a concept of "volatile file
>> handle."
>
> Can I presume the inode number cannot be considered stable if the NFS4 FH is
> non-volatile? Furthermore, can I presume NFS2/3 inode numbers are supposed to
> be stable?
>
>> Basically the question is whether it is worth reporting a flag on the call
>> which returns the inode number to indicate that the inode number is "stable"
>> (would not change on reboot or reconnection) or "volatile." Since the
>> majority of NFS and SMB2 servers can return stable inode numbers, I don't
>> feel strongly about the need for an indicator of "stable" vs. "volatile" but
>> I mention it because backup and migration applications mention this (if inode
>> numbers are volatile, they may have to check for hardlinks differently for
>> example)
>
> It may be that unsetting XSTAT_INO if you've fabricated the inode number
> locally is sufficient.
>
>>>>> Handle remote filesystems being offline and indicate this with
>>>>> XSTAT_INFO_OFFLINE.
>>>>
>>>> You already have support for an indicator for offline files (HSM),
>
> Which indicator is this? Or do you mean XSTAT_INFO_OFFLINE?
>
>>>> would XSTAT_INFO_OFFLINE be intended for the case
>>>> where the network session to the server is disconnected
>>>> (and in which you case the application does not want to reconnect)?
>>>
>>> Hmmm... Interesting question. Both NTFS and CIFS have an offline
>>> attribute (which is where I originally got this from) - but should I have a
>>> separate indicator to indicate the client can't access a server over a
>>> network (ie. we've gone to disconnected operation on this file)?
>>> E.g. should there be a XSTAT_INFO_DISCONNECTED too?
>>
>> my reaction is no, since it adds complexity. If you do a stat on a
>> disconnected volume (where the network is temporarily down) reconnection will
>> be attempted. If reconnection fails then the xstat will either fail or be
>> retried forever depending on the value of "hard" vs. "soft" mount flag.
>
> I was thinking of how to handle disconnected operation, where you can't just
> sit there and churn waiting for the server to come back or give an error. On
> the other hand, as long as there's some spare space in the struct, we can deal
> with that later when we actually start to implement D/O.
>
> David
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-04-27 09:19:09

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

Andreas Dilger <[email protected]> wrote:

> > However, since there is a remote file ID, we could add an
> > XSTAT_INFO_FILE_ID flag to indicate there's a standard xattr holding this.
>
> It is a bit strange that the kernel would return a flag that was not
> requested, but not fatal.

On the other hand, if it costs the kernel nothing to generate this indicator...

For network filesystems, for instance, you might know that there is some other
file ID because of the way the fs is specified.

I was thinking that many of the fields will be given values, even if you don't
ask for them, because the values (or approximations thereof) are present in
memory already.

David

2012-04-27 09:39:05

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

Dave Chinner <[email protected]> wrote:

> If we are adding per-inode flags, then what do we do with filesystem specific
> flags? e.g. XFS has quite a number of per-inode flags that don't align with
> any other filesystem (e.g. filestream allocator, real time file, behaviour
> inheritence flags, etc), but may be useful to retrieve in such a call. We
> currently have an ioctl to get that information from each inode. Have you
> thought about how to handle such flags?

I haven't looked at XFS with regard to xstat as yet, so I'm not sure exactly
which flags you're talking about. The question, though, is what will actually
make use of these flags? Will it just be XFS tools or are they something that
a GUI might make use of?

Either you can add some of them to the ioc flags (which may be impractical, I
grant you) or we'd have to add an arbitrary fs-type specific field and specify
the host fs (the provision of which might not be a bad idea in and of itself)
to tell userspace how to interpret them.

> Along the same lines, filesytsems can have different allocation constraints
> to IO the filesystem block size - ext4 with it's bigalloc hack, XFS with it's
> per-inode extent size hints and the realtime device, etc. Then there's
> optimal IO characteristics (e.g. geometery hints like stripe unit/stripe
> width for the allocation policy of that given file) that applications could
> use if they were present rather than having to expose them through ioctls
> that nobody even knows about...

Yeah... Not representable by one number. You'd have to unset a flag to say
you were providing this information.

However, providing a whole bunch of hints about I/O characteristics is probably
beyond this syscall - especially if it isn't constant over the length of a
file. That's specialist knowledge that most applications don't need to know.
Having a generic way to retrieve it, though, may be a good idea.

OTOH, there's plenty of uncommitted space, so if we can condense the hints down
to something small, we could perhaps add it later - but from your paragraph
above, it doesn't sound like it'll be small.

> Perhaps also exposing the project ID for quota purposes, like we do UID and
> GID. That way we wouldn't need a filesystem specific ioctl to read it....

Is this an XFS only thing? If so, can it be generalised?

David

2012-04-27 13:13:06

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On Fri, Apr 27, 2012 at 10:39:05AM +0100, David Howells wrote:
> Dave Chinner <[email protected]> wrote:
>
> > If we are adding per-inode flags, then what do we do with filesystem specific
> > flags? e.g. XFS has quite a number of per-inode flags that don't align with
> > any other filesystem (e.g. filestream allocator, real time file, behaviour
> > inheritence flags, etc), but may be useful to retrieve in such a call. We
> > currently have an ioctl to get that information from each inode. Have you
> > thought about how to handle such flags?
>
> I haven't looked at XFS with regard to xstat as yet, so I'm not sure exactly
> which flags you're talking about. The question, though, is what will actually
> make use of these flags? Will it just be XFS tools or are they something that
> a GUI might make use of?

Have a look at fs/xfs/xfs_dinode.h. There's a bunch of flags defined
at the bottom of the file.

Stuff like the "nodefrag", "nodump", and "prealloc" bits seem fairly
generic - they are for indicating that files are to be avoided for defrag or
backup purposes, the prealloc bit indicates that fallocate has been
used to reserve space on the inode (finding files that space can be
punched out of safely), and so on.

Currently these things are queried and manipulated by ioctls
(XFS_IOC_FSX[GS]ETATTR) along with extent size hints, project
quotas, etc. but I think there's some wider use for many of the
flags, which is why I was asking is there's any thought to this sort
of flag being exposed by the VFS.

Historically the flags exposed by the VFS are those used by extN - I
see little reason why we should favour one filesystem's flags over
any others in an extended stat interface if they are generically
useful....

> Either you can add some of them to the ioc flags (which may be impractical, I
> grant you) or we'd have to add an arbitrary fs-type specific field and specify
> the host fs (the provision of which might not be a bad idea in and of itself)
> to tell userspace how to interpret them.

Well, that's the complexity, isn't it. I have no good answer to
that...

> > Along the same lines, filesytsems can have different allocation constraints
> > to IO the filesystem block size - ext4 with it's bigalloc hack, XFS with it's
> > per-inode extent size hints and the realtime device, etc. Then there's
> > optimal IO characteristics (e.g. geometery hints like stripe unit/stripe
> > width for the allocation policy of that given file) that applications could
> > use if they were present rather than having to expose them through ioctls
> > that nobody even knows about...
>
> Yeah... Not representable by one number. You'd have to unset a flag to say
> you were providing this information.
>
> However, providing a whole bunch of hints about I/O characteristics is probably
> beyond this syscall - especially if it isn't constant over the length of a
> file. That's specialist knowledge that most applications don't need to know.
> Having a generic way to retrieve it, though, may be a good idea.

We're continually talking about applications giving us usage hints
on what IO they are going to do so the storage can optimise the IO.
IO is still a GIGO problem, though, and the idea of geometry hints is to enable
us to tell the application to do well formed IO. i.e. less garbage.

XFS has ioctls to expose filesystem geometry, optimal IO sizes, the
alignment limits for direct IO, etc, and they are very useful to
applications that care about high performance IO. A lot of this can
be distilled down to a simple set of geometries, and generally
speaking they don't change mid way through a file....

> OTOH, there's plenty of uncommitted space, so if we can condense the hints down
> to something small, we could perhaps add it later - but from your paragraph
> above, it doesn't sound like it'll be small.

Allocation block size, minimum sane IO size (to avoid page cache RMW
cycles or DIO zeroing), minimum prefered IO size (e.g. stripe unit),
optimal IO size for bandwidth (e.g. stripe width). I don't think
there's much more than that which will be really usable by
applications.

> > Perhaps also exposing the project ID for quota purposes, like we do UID and
> > GID. That way we wouldn't need a filesystem specific ioctl to read it....
>
> Is this an XFS only thing? If so, can it be generalised?

Right now it is, but there's ben patches in the past to introduce
project quotas to ext4. That didn't go far because it was done in a
way that was semantically different to XFS (for no reason that I
could understand) and nobody wanted two different sets of semantics
for the "same" feature. The most common use of project quotas is to
implement sub-tree quotas, which is probably of more interest to
btrfs folks as it is an exact match for per-subvolume quotas.

So, yes, I do see it as something generically useful - it's a
feature that a lot of people use XFS specifically for....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-04-27 15:11:25

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On Fri, Apr 27, 2012 at 11:13:06PM +1000, Dave Chinner wrote:
> Right now it is, but there's ben patches in the past to introduce
> project quotas to ext4. That didn't go far because it was done in a
> way that was semantically different to XFS (for no reason that I
> could understand) and nobody wanted two different sets of semantics
> for the "same" feature. The most common use of project quotas is to
> implement sub-tree quotas,

(Though it's also useful as a way to do safe subtree NFS exports).

--b.

> which is probably of more interest to
> btrfs folks as it is an exact match for per-subvolume quotas.
>
> So, yes, I do see it as something generically useful - it's a
> feature that a lot of people use XFS specifically for....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-04-27 16:32:46

by Steve French

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On Fri, Apr 27, 2012 at 10:10 AM, J. Bruce Fields <[email protected]> wrote:
> On Fri, Apr 27, 2012 at 11:13:06PM +1000, Dave Chinner wrote:
>> Right now it is, but there's ben patches in the past to introduce
>> project quotas to ext4. That didn't go far because it was done in a
>> way that was semantically different to XFS (for no reason that I
>> could understand) and nobody wanted two different sets of semantics
>> for the "same" feature. ?The most common use of project quotas is to
>> implement sub-tree quotas,
>
> (Though it's also useful as a way to do safe subtree NFS exports).

Quotas are very important to Samba (and Windows) admins.

See e.g.
http://www.samba.org/samba/docs/man/manpages-3/smbcquotas.1.html

But I would defer to Metze and others on the server side to see
whether XFS or the proposed ext4 quotas match the protocol
requirements which Samba has to support.

Does anyone know whether Samba can handle all of the remote quota
admin operations on Linux/XFS today (and local enforcement)?

--
Thanks,

Steve

2012-04-27 18:43:57

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On 2012-04-26, at 7:06 PM, Dave Chinner wrote:
> On Thu, Apr 19, 2012 at 03:05:58PM +0100, David Howells wrote:
>>
>> Implement a pair of new system calls to provide extended and further extensible stat functions.
>>
>> The second of the associated patches is the main patch that provides these new system calls:
>>
>> ssize_t ret = xstat(int dfd,
>> const char *filename,
>> unsigned atflag,
>> unsigned mask,
>> struct xstat *buffer);
>>
>> ssize_t ret = fxstat(int fd,
>> unsigned atflag,
>> unsigned mask,
>> struct xstat *buffer);
>>
>> which are more fully documented in the first patch's description.
>>
>> These new stat functions provide a number of useful features, in summary:
>>
>> (1) More information: creation time, inode generation number, data
>> version number, flags/attributes. A subset of these is available
>> through a number of filesystems (CIFS, NFS, AFS, Ext4 and BTRFS).
>
> If we are adding per-inode flags, then what do we do with filesystem
> specific flags? e.g. XFS has quite a number of per-inode flags that
> don't align with any other filesystem (e.g. filestream allocator,
> real time file, behaviour inheritence flags, etc), but may be useful
> to retrieve in such a call. We currently have an ioctl to get that
> information from each inode. Have you thought about how to handle
> such flags?

I'm sympathetic to your cause, but I don't want this to degrade into
the same morass that it did last time when every attribute under the
sun was added to the call. The intent is to replace the stat() call
with something that can avoid overhead on filesystems for which some
attributes are expensive, and that applications may not need. Some
common attributes were added that are used by multiple filesystems.

If it is too filesystem-specific, and there is little possibility
that these attributes will be usable on other filesystems, then it
should remain a filesystem specific ioctl() call. If you can make
a case that these attributes have value on a few other filesystems,
and applications are reasonably likely to be able to use them, and
their addition does not make the API overly complex, then suggest
away.

> Along the same lines, filesytsems can have different allocation
> constraints to IO the filesystem block size - ext4 with it's
> bigalloc hack, XFS with it's per-inode extent size hints and the
> realtime device, etc. Then there's optimal IO characteristics
> (e.g. geometery hints like stripe unit/stripe width for the
> allocation policy of that given file) that applications could use
> if they were present rather than having to expose them through
> ioctls that nobody even knows about...

There is already "optimal IO size" that the application can use,
how do the geometry hints differ? Userspace is able to handle
st_blksize of several MB in size without problems, and any sane
application will do the IO sized + aligned on multiples of this.

> Perhaps also exposing the project ID for quota purposes, like we do
> UID and GID. That way we wouldn't need a filesystem specific ioctl
> to read it....

This seems reasonable and generic and simple. This is similar to
directory quotas in other filesystems.


Cheers, Andreas






2012-04-27 19:31:07

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On 2012-04-27, at 7:13 AM, Dave Chinner wrote:
> Have a look at fs/xfs/xfs_dinode.h. There's a bunch of flags defined
> at the bottom of the file.
>
> Stuff like the "nodefrag", "nodump", and "prealloc" bits seem fairly
> generic - they are for indicating that files are to be avoided for
> defrag or backup purposes, the prealloc bit indicates that fallocate
> has been used to reserve space on the inode (finding files that space
> can be punched out of safely), and so on.

There is already the FS_NODUMP_FL in the standard FS_IOC_GETFLAGS ioctl
and I expect this to be in statxat() also. In ext4 there was also an
EXT4_EOFBLOCKS_FL added for inodes with fallocate'd data beyond EOF,
but Eric thought it was a pain to maintain and it has been deprecated
in ext4 and e2fsprogs recently.

> Currently these things are queried and manipulated by ioctls
> (XFS_IOC_FSX[GS]ETATTR) along with extent size hints, project
> quotas, etc. but I think there's some wider use for many of the
> flags, which is why I was asking is there's any thought to this sort
> of flag being exposed by the VFS.
>
> Historically the flags exposed by the VFS are those used by extN - I
> see little reason why we should favour one filesystem's flags over
> any others in an extended stat interface if they are generically
> useful....

Sure, they started as ext4 flags because the "lsattr" and "chattr"
tools were using this ioctl/flags, but have become more generic in
recent years. FS_NOTAIL_FL was added for Reiserfs, and FS_NOCOW_FL
was added for another filesystem (maybe Btrfs?). I'm not against
adding more flags here that are generically useful, and recommended
that statxat() have a 64-bit st_ioc_flags, since there are already
22 FS_*_FL flags defined today.

>> Either you can add some of them to the ioc flags (which may be
>> impractical, I grant you) or we'd have to add an arbitrary fs-type
>> specific field and specify the host fs (the provision of which might
>> not be a bad idea in and of itself) to tell userspace how to interpret them.
>
> Well, that's the complexity, isn't it. I have no good answer to
> that...
>
>>> Along the same lines, filesytsems can have different allocation
>>> constraints to IO the filesystem block size - ext4 with it's
>>> bigalloc hack, XFS with it's per-inode extent size hints and the
>>> realtime device, etc. Then there's optimal IO characteristics
>>> (e.g. geometery hints like stripe unit/stripe width for the
>>> allocation policy of that given file) that applications could
>>> use if they were present rather than having to expose them
>>> through ioctls that nobody even knows about...
>>
>> Yeah... Not representable by one number. You'd have to unset a
>> flag to say you were providing this information.
>>
>> However, providing a whole bunch of hints about I/O characteristics
>> is probably beyond this syscall - especially if it isn't constant
>> over the length of a file. That's specialist knowledge that most
>> applications don't need to know.
>> Having a generic way to retrieve it, though, may be a good idea.
>
> We're continually talking about applications giving us usage hints
> on what IO they are going to do so the storage can optimise the IO.
> IO is still a GIGO problem, though, and the idea of geometry hints
> is to enable us to tell the application to do well formed IO. i.e.
> less garbage.
>
> XFS has ioctls to expose filesystem geometry, optimal IO sizes, the
> alignment limits for direct IO, etc, and they are very useful to
> applications that care about high performance IO. A lot of this can
> be distilled down to a simple set of geometries, and generally
> speaking they don't change mid way through a file....
>
>> OTOH, there's plenty of uncommitted space, so if we can condense
>> the hints down to something small, we could perhaps add it later -
>> but from your paragraph above, it doesn't sound like it'll be small.
>
> Allocation block size, minimum sane IO size (to avoid page cache RMW
> cycles or DIO zeroing), minimum prefered IO size (e.g. stripe unit),
> optimal IO size for bandwidth (e.g. stripe width). I don't think
> there's much more than that which will be really usable by
> applications.

I think this is a minimal set that makes sense, and is manageable for
both the interface and for users. Even if it isn't 100% correct for
every file of every filesystem, it still makes sense for many systems.
I'd suggest st_frsize (like BSD statvfs() f_frsize) would be the
minimum fragment or page size, st_iosize (BSD f_iosize) could be
the optimal IO size, and "st_stripesize" for the minimum preferred RAID/chunk size.

One could argue that "st_blksize" is used for the "optimal IO size"
on Linux today, but this is an overloaded term. It _appears_ to
represent the filesystem blocksize, which it usually is not, and on
BSD st_bsize means the minimum blocksize and has a confusingly
similar name. Since any application using this API needs to do some
extra coding already, we may as well give the structure members good
names that are not ambiguous.

>>> Perhaps also exposing the project ID for quota purposes, like we
>>> do UID and GID. That way we wouldn't need a filesystem specific
>>> ioctl to read it....
>>
>> Is this an XFS only thing? If so, can it be generalised?
>
> Right now it is, but there's been patches in the past to introduce
> project quotas to ext4. That didn't go far because it was done in a
> way that was semantically different to XFS (for no reason that I
> could understand) and nobody wanted two different sets of semantics
> for the "same" feature. The most common use of project quotas is to
> implement sub-tree quotas, which is probably of more interest to
> btrfs folks as it is an exact match for per-subvolume quotas.
>
> So, yes, I do see it as something generically useful - it's a
> feature that a lot of people use XFS specifically for....

I'd agree. There was the tree quota project for ext4, and I've also
heard this is available in other filesystems.

Cheers, Andreas






2012-04-27 23:54:41

by Paul Eggert

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On 04/26/2012 11:25 AM, Roland McGrath wrote:
> What I'd really prefer is a name that is less meaninglessly arcane,

Since people are used to the string "stat", and since this is
about getting related attributes, how about using the string "statr"?
Thus "statrat" could be the syscall that acts like fstatat, but gets
related attributes too.

A bonus of this name is that a "stat rat" is slang
for someone who loves looking at statistics, often
sports statistics. Billy Beane is a stat rat.
Someone who cares about files' statistics is also
a stat rat.

The "r" in "statrat" could stand for "related", or
for "relevant", or for whatever you like.

2012-04-28 00:38:33

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On Thu, Apr 26, 2012 at 09:22:04PM -0600, Andreas Dilger wrote:
> On 2012-04-26, at 7:06 PM, Dave Chinner wrote:
> > On Thu, Apr 19, 2012 at 03:05:58PM +0100, David Howells wrote:
> >>
> >> Implement a pair of new system calls to provide extended and further extensible stat functions.
> >>
> >> The second of the associated patches is the main patch that provides these new system calls:
> >>
> >> ssize_t ret = xstat(int dfd,
> >> const char *filename,
> >> unsigned atflag,
> >> unsigned mask,
> >> struct xstat *buffer);
> >>
> >> ssize_t ret = fxstat(int fd,
> >> unsigned atflag,
> >> unsigned mask,
> >> struct xstat *buffer);
> >>
> >> which are more fully documented in the first patch's description.
> >>
> >> These new stat functions provide a number of useful features, in summary:
> >>
> >> (1) More information: creation time, inode generation number, data
> >> version number, flags/attributes. A subset of these is available
> >> through a number of filesystems (CIFS, NFS, AFS, Ext4 and BTRFS).
> >
> > If we are adding per-inode flags, then what do we do with filesystem
> > specific flags? e.g. XFS has quite a number of per-inode flags that
> > don't align with any other filesystem (e.g. filestream allocator,
> > real time file, behaviour inheritence flags, etc), but may be useful
> > to retrieve in such a call. We currently have an ioctl to get that
> > information from each inode. Have you thought about how to handle
> > such flags?
>
> I'm sympathetic to your cause, but I don't want this to degrade into
> the same morass that it did last time when every attribute under the
> sun was added to the call.

Understood, which is why I'm not asking for everything under the sun
to be supported. I'm more interested in finding the useful subset of
information that a typical application might make use of.

> The intent is to replace the stat() call
> with something that can avoid overhead on filesystems for which some
> attributes are expensive, and that applications may not need. Some
> common attributes were added that are used by multiple filesystems.
>
> If it is too filesystem-specific, and there is little possibility
> that these attributes will be usable on other filesystems, then it
> should remain a filesystem specific ioctl() call.

Right, that's why I didn't mention the real-time bits, the
filestream allocation bits, or other things that are tightly bound
to the way XFS works....

> If you can make
> a case that these attributes have value on a few other filesystems,
> and applications are reasonably likely to be able to use them, and
> their addition does not make the API overly complex, then suggest
> away.

Exactly my thoughts ;)

> > Along the same lines, filesytsems can have different allocation
> > constraints to IO the filesystem block size - ext4 with it's
> > bigalloc hack, XFS with it's per-inode extent size hints and the
> > realtime device, etc. Then there's optimal IO characteristics
> > (e.g. geometery hints like stripe unit/stripe width for the
> > allocation policy of that given file) that applications could use
> > if they were present rather than having to expose them through
> > ioctls that nobody even knows about...
>
> There is already "optimal IO size" that the application can use,
> how do the geometry hints differ?

Have a look at how XFS overloads stat.st_blksize depending on the
filesystem and inode config. It's amazingly convoluted, and based on
a combination of filesystem geometry, inode bits and mount options:

xfs_vn_getattr()
....
if (XFS_IS_REALTIME_INODE(ip)) {
/*
* If the file blocks are being allocated from a
* realtime volume, then return the inode's realtime
* extent size or the realtime volume's extent size.
*/
stat->blksize =
xfs_get_extsz_hint(ip) << mp->m_sb.sb_blocklog;
} else
stat->blksize = xfs_preferred_iosize(mp);
......

xfs_extlen_t
xfs_get_extsz_hint(
struct xfs_inode *ip)
{
if ((ip->i_d.di_flags & XFS_DIFLAG_EXTSIZE) && ip->i_d.di_extsize)
return ip->i_d.di_extsize;
if (XFS_IS_REALTIME_INODE(ip))
return ip->i_mount->m_sb.sb_rextsize;
return 0;
}

....

static inline unsigned long
xfs_preferred_iosize(xfs_mount_t *mp)
{
if (mp->m_flags & XFS_MOUNT_COMPAT_IOSIZE)
return PAGE_CACHE_SIZE;
return (mp->m_swidth ?
(mp->m_swidth << mp->m_sb.sb_blocklog) :
((mp->m_flags & XFS_MOUNT_DFLT_IOSIZE) ?
(1 << (int)MAX(mp->m_readio_log, mp->m_writeio_log)) :
PAGE_CACHE_SIZE));
}

All of that can be exported as 4 parameters for normal files:

allocation block size (extent size hint)
minimum io size (PAGE_CACHE_SIZE)
preferred minimum IO size (mp->m_readio_log/mp->m_writeio_log)
best aligned IO size (stripe width)

And for realtime files it's a bit different because of the
block-based bitmap allocator it uses:

allocation block size (extent size hint)
minimum io size (PAGE_CACHE_SIZE)
preferred minimum IO size (extent size hint)
best aligned IO size (some multiple of extent size hint)

> Userspace is able to handle
> st_blksize of several MB in size without problems, and any sane
> application will do the IO sized + aligned on multiples of this.

Actually, some applications still have problems with that. That's
the reason we only expose stripe widths in st_blksize when a mount
option is set. Stripe widths are known to get into the tens of MB,
and applications using st_blksize for memory allocation of IO
buffers tend to get into trouble with those.

That's why I'd prefer specific optimal IO hints - we don't have to
overload st_blksize with lots of meanings to pass what is relatively
trivial information back to the application.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-04-28 00:54:13

by Steve French

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On Fri, Apr 27, 2012 at 7:38 PM, Dave Chinner <[email protected]> wrote:
> preferred minimum IO size (mp->m_readio_log/mp->m_writeio_log)

This discussion about i/o sizes is very interesting. For network
file system (at least for SMB2 to all known servers, and
for cifs mounts to Samba, but probably for recent NFS), ideal i/o
sizes are often well over a megabyte ... but how to indicate that to
the application...

> That's why I'd prefer specific optimal IO hints - we don't have to
> overload st_blksize with lots of meanings to pass what is relatively
> trivial information back to the application.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner

Yes.




--
Thanks,

Steve

2012-04-28 00:58:56

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

On Fri, Apr 27, 2012 at 01:31:07PM -0600, Andreas Dilger wrote:
> On 2012-04-27, at 7:13 AM, Dave Chinner wrote:
> > Have a look at fs/xfs/xfs_dinode.h. There's a bunch of flags defined
> > at the bottom of the file.
> >
> > Stuff like the "nodefrag", "nodump", and "prealloc" bits seem fairly
> > generic - they are for indicating that files are to be avoided for
> > defrag or backup purposes, the prealloc bit indicates that fallocate
> > has been used to reserve space on the inode (finding files that space
> > can be punched out of safely), and so on.
>
> There is already the FS_NODUMP_FL in the standard FS_IOC_GETFLAGS ioctl
> and I expect this to be in statxat() also.

I forgot that was one of the generic flags :/

> In ext4 there was also an
> EXT4_EOFBLOCKS_FL added for inodes with fallocate'd data beyond EOF,
> but Eric thought it was a pain to maintain and it has been deprecated
> in ext4 and e2fsprogs recently.

I'd think that flag is more of a "filesystem implementation
specific" flag than a general "this file contained persistent
preallocation" flag, which is essentially what the XFS flag says.
XFS uses in various ways to optimise extent management on the file
(e.g. don't truncate extents past EOF when closing the file), but it
is not specific to one particular aspect of the preallocation
implementation.

> >> OTOH, there's plenty of uncommitted space, so if we can condense
> >> the hints down to something small, we could perhaps add it later -
> >> but from your paragraph above, it doesn't sound like it'll be small.
> >
> > Allocation block size, minimum sane IO size (to avoid page cache RMW
> > cycles or DIO zeroing), minimum prefered IO size (e.g. stripe unit),
> > optimal IO size for bandwidth (e.g. stripe width). I don't think
> > there's much more than that which will be really usable by
> > applications.
>
> I think this is a minimal set that makes sense, and is manageable for
> both the interface and for users. Even if it isn't 100% correct for
> every file of every filesystem, it still makes sense for many systems.

That's the aim, isn't it? To expose what is useful to the majority
in a simple manner?

> I'd suggest st_frsize (like BSD statvfs() f_frsize) would be the
> minimum fragment or page size, st_iosize (BSD f_iosize) could be
> the optimal IO size, and "st_stripesize" for the minimum preferred RAID/chunk size.

Personally, I think those names are, well, terribly lacking in
obviousness. Something more along the lines of:

st_blksize - file block size
st_alloc_blksize - allocation block size/alignment
st_small_io_size - IO size/alignment that avoids
filesystem/page cache RMW
st_preferred_io_size - preferred IO size for general
usage.
st_large_io_size - IO size/alignment for high
bandwidth sequential IO

With the aim that applications tend to use st_preferred_io_size for
all general IO (i.e. the default), st_small_io_size for small IO,
IOPS intensive workloads, and st_large_io_size for writing large
chunks of sequential data.

> One could argue that "st_blksize" is used for the "optimal IO size"
> on Linux today, but this is an overloaded term. It _appears_ to
> represent the filesystem blocksize, which it usually is not, and on
> BSD st_bsize means the minimum blocksize and has a confusingly
> similar name. Since any application using this API needs to do some
> extra coding already, we may as well give the structure members good
> names that are not ambiguous.

Well said - I couldn't have stated the case better myself. ;)

Cheers,

Dave.

--
Dave Chinner
[email protected]

2012-05-10 09:51:38

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 0/6] Extended file stat system call

Dave Chinner <[email protected]> wrote:

> st_blksize - file block size
> st_alloc_blksize - allocation block size/alignment
> st_small_io_size - IO size/alignment that avoids
> filesystem/page cache RMW
> st_preferred_io_size - preferred IO size for general
> usage.
> st_large_io_size - IO size/alignment for high
> bandwidth sequential IO

What is st_blksize here? Is it directly comparable (if such a thing is
possible) to st_blksize in struct stat? Or does st_preferred_io_size map to
the current st_blksize and your st_blksize map to actual media block size (and
if so, is that the same as st_alloc_blksize? - I presume st_alloc_blksize
must be a multiple of the media block size).

David