2012-05-08 20:19:42

by David Howells

[permalink] [raw]
Subject: Extended file stat: Splitting file- and fs-specific info?


Should I split the file-specific info and the fs-specific info and make the
second optional? What I'm thinking of is something like this:

Have a file information structure:

struct statx {
/* 0x00 */
uint32_t st_mask; /* What results were written */
uint32_t st_information; /* Information about the file */
uint16_t st_mode; /* File mode */
uint16_t __spare0[3];
/* 0x10 */
uint32_t st_uid; /* User ID of owner */
uint32_t st_gid; /* Group ID of owner */
uint32_t st_nlink; /* Number of hard links */
uint32_t st_blksize; /* Optimal size for filesystem I/O */
/* 0x20 */
struct statx_dev st_rdev; /* Device ID of special file */
struct statx_dev st_dev; /* ID of device containing file */
/* 0x30 */
int32_t st_atime_ns; /* Last access time (ns part) */
int32_t st_btime_ns; /* File creation time (ns part) */
int32_t st_ctime_ns; /* Last attribute change time (ns part) */
int32_t st_mtime_ns; /* Last data modification time (ns part) */
/* 0x40 */
int64_t st_atime; /* Last access time */
int64_t st_btime; /* File creation time */
int64_t st_ctime; /* Last attribute change time */
int64_t st_mtime; /* Last data modification time */
/* 0x60 */
uint64_t st_ino; /* Inode number */
uint64_t st_size; /* File size */
uint64_t st_blocks; /* Number of 512-byte blocks allocated */
uint64_t st_gen; /* Inode generation number */
uint64_t st_version; /* Data version number */
uint64_t st_ioc_flags; /* As FS_IOC_GETFLAGS */
/* 0x90 */
uint64_t __spare1[13]; /* Spare space for future expansion */
/* 0x100 */
};

And an fs information structure for less commonly needed data:

struct statx_fsinfo {
/* 0x00 - General info */
uint32_t st_mask; /* What optional fields are filled in */
uint32_t st_type; /* Filesystem type from linux/magic.h */

/* 0x08 - file timestamp granularity info */
uint16_t st_atime_gran_mantissa; /* gran(secs) = mant * 10^exp */
uint16_t st_btime_gran_mantissa;
uint16_t st_ctime_gran_mantissa;
uint16_t st_mtime_gran_mantissa;
/* 0x10 */
int8_t st_atime_gran_exponent;
int8_t st_btime_gran_exponent;
int8_t st_ctime_gran_exponent;
int8_t st_mtime_gran_exponent;

/* 0x14 - I/O parameters */
uint32_t st_blksize; /* File block size */
uint32_t st_alloc_blksize; /* Allocation block size/alignment */
uint32_t st_small_io_size; /* IO size/alignment that avoids fs/page cache RMW */
uint32_t st_pref_io_size; /* Preferred IO size for general usage */
uint32_t st_large_io_size; /* IO size/alignment for high bandwidth sequential IO */

/* 0x28 - Restrictions on struct statx contents */
uint64_t st_supported_ioc_flags; /* FS_IOC_GETFLAGS flags supported */

/* 0x30 - Volume/filesystem information */
uint64_t st_fsid; /* Short 64-bit Filesystem ID (as statfs) */
uint64_t __spare0[3];
/* 0x50 */
uint8_t st_volume_id[16]; /* Volume/fs identifier */
uint8_t st_volume_uuid[16]; /* Volume/fs UUID */
/* 0x80 */
uint64_t __spare1[8];
/* 0xc0 */
uint8_t st_volume_name[64]; /* Volume name (up to 64 chars) */
/* 0x100 */
uint8_t st_domain_name[256]; /* Domain/cell/workgroup name (up to 256 chars) */
/* 0x200 */
};

One could argue a bit over what goes in which, should we go for this. This
may be better split between multiple syscalls though (with the race that that
implies) and potentially merging with statfs.


The statxat() syscall [née xstat] could then use the 6th parameter thusly:

asmlinkage long sys_statxat(int dfd, const char __user *path, unsigned flags,
unsigned mask, struct statx __user *buffer,
struct statx_fsinfo __user *fsinfo);


letting fsinfo be NULL to indicate a lack of interest. I'm not sure we want
to do that, though.


Also, do Dave Chinner's ideas for indicating five I/O parameters want to be
32-bit numbers? Larger? Smaller? Can they be log2?

Note also, that I've suggested that we represent the timestamp granularity
information as a decimal float (which requires 3 bytes per timestamp) and that
we provide separate granularities for each timestamp.

David


2012-05-08 21:13:21

by Myklebust, Trond

[permalink] [raw]
Subject: Re: Extended file stat: Splitting file- and fs-specific info?

On Tue, 2012-05-08 at 21:19 +0100, David Howells wrote:
> Should I split the file-specific info and the fs-specific info and make the
> second optional? What I'm thinking of is something like this:
>
> Have a file information structure:
>
> struct statx {
> /* 0x00 */
> uint32_t st_mask; /* What results were written */
> uint32_t st_information; /* Information about the file */
> uint16_t st_mode; /* File mode */
> uint16_t __spare0[3];
> /* 0x10 */
> uint32_t st_uid; /* User ID of owner */
> uint32_t st_gid; /* Group ID of owner */
> uint32_t st_nlink; /* Number of hard links */
> uint32_t st_blksize; /* Optimal size for filesystem I/O */
> /* 0x20 */
> struct statx_dev st_rdev; /* Device ID of special file */
> struct statx_dev st_dev; /* ID of device containing file */
> /* 0x30 */
> int32_t st_atime_ns; /* Last access time (ns part) */
> int32_t st_btime_ns; /* File creation time (ns part) */
> int32_t st_ctime_ns; /* Last attribute change time (ns part) */
> int32_t st_mtime_ns; /* Last data modification time (ns part) */
> /* 0x40 */
> int64_t st_atime; /* Last access time */
> int64_t st_btime; /* File creation time */
> int64_t st_ctime; /* Last attribute change time */
> int64_t st_mtime; /* Last data modification time */
> /* 0x60 */
> uint64_t st_ino; /* Inode number */
> uint64_t st_size; /* File size */

Should we consider making the st_size and st_blocks 128-bit values while
we're at it? Alternatively, we could add an st_ioc_flag for it later...

> uint64_t st_blocks; /* Number of 512-byte blocks allocated */
> uint64_t st_gen; /* Inode generation number */
> uint64_t st_version; /* Data version number */
> uint64_t st_ioc_flags; /* As FS_IOC_GETFLAGS */
> /* 0x90 */
> uint64_t __spare1[13]; /* Spare space for future expansion */
> /* 0x100 */
> };
>
> And an fs information structure for less commonly needed data:
>
> struct statx_fsinfo {
> /* 0x00 - General info */
> uint32_t st_mask; /* What optional fields are filled in */
> uint32_t st_type; /* Filesystem type from linux/magic.h */
>
> /* 0x08 - file timestamp granularity info */
> uint16_t st_atime_gran_mantissa; /* gran(secs) = mant * 10^exp */
> uint16_t st_btime_gran_mantissa;
> uint16_t st_ctime_gran_mantissa;
> uint16_t st_mtime_gran_mantissa;
> /* 0x10 */
> int8_t st_atime_gran_exponent;
> int8_t st_btime_gran_exponent;
> int8_t st_ctime_gran_exponent;
> int8_t st_mtime_gran_exponent;
>
> /* 0x14 - I/O parameters */
> uint32_t st_blksize; /* File block size */
> uint32_t st_alloc_blksize; /* Allocation block size/alignment */
> uint32_t st_small_io_size; /* IO size/alignment that avoids fs/page cache RMW */
> uint32_t st_pref_io_size; /* Preferred IO size for general usage */
> uint32_t st_large_io_size; /* IO size/alignment for high bandwidth sequential IO */
>
> /* 0x28 - Restrictions on struct statx contents */
> uint64_t st_supported_ioc_flags; /* FS_IOC_GETFLAGS flags supported */
>
> /* 0x30 - Volume/filesystem information */
> uint64_t st_fsid; /* Short 64-bit Filesystem ID (as statfs) */
> uint64_t __spare0[3];
> /* 0x50 */
> uint8_t st_volume_id[16]; /* Volume/fs identifier */
> uint8_t st_volume_uuid[16]; /* Volume/fs UUID */
> /* 0x80 */
> uint64_t __spare1[8];
> /* 0xc0 */
> uint8_t st_volume_name[64]; /* Volume name (up to 64 chars) */
> /* 0x100 */
> uint8_t st_domain_name[256]; /* Domain/cell/workgroup name (up to 256 chars) */
> /* 0x200 */
> };

If you are making a separate fsinfo structure, then it would be nice to
have flags to indicate what kind of acls the filesystem supports, and if
it supports features such as xattrs, subfiles and/or snapshots.



--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2012-05-09 00:24:20

by Dave Chinner

[permalink] [raw]
Subject: Re: Extended file stat: Splitting file- and fs-specific info?

On Tue, May 08, 2012 at 09:19:42PM +0100, David Howells wrote:
>
> Should I split the file-specific info and the fs-specific info and make the
> second optional? What I'm thinking of is something like this:
>
> Have a file information structure:
>
> struct statx {
> /* 0x00 */
> uint32_t st_mask; /* What results were written */
> uint32_t st_information; /* Information about the file */
> uint16_t st_mode; /* File mode */
> uint16_t __spare0[3];
> /* 0x10 */
> uint32_t st_uid; /* User ID of owner */
> uint32_t st_gid; /* Group ID of owner */
> uint32_t st_nlink; /* Number of hard links */
> uint32_t st_blksize; /* Optimal size for filesystem I/O */
> /* 0x20 */
> struct statx_dev st_rdev; /* Device ID of special file */
> struct statx_dev st_dev; /* ID of device containing file */
> /* 0x30 */
> int32_t st_atime_ns; /* Last access time (ns part) */
> int32_t st_btime_ns; /* File creation time (ns part) */
> int32_t st_ctime_ns; /* Last attribute change time (ns part) */
> int32_t st_mtime_ns; /* Last data modification time (ns part) */
> /* 0x40 */
> int64_t st_atime; /* Last access time */
> int64_t st_btime; /* File creation time */
> int64_t st_ctime; /* Last attribute change time */
> int64_t st_mtime; /* Last data modification time */
> /* 0x60 */
> uint64_t st_ino; /* Inode number */
> uint64_t st_size; /* File size */
> uint64_t st_blocks; /* Number of 512-byte blocks allocated */
> uint64_t st_gen; /* Inode generation number */

I don't think we want to expose the inode generation numbers. It is
trivial to construct NFS file handles (usually just fsid, inode
number and generation) with that information and hence bypass
security checks to access files.

> uint64_t st_version; /* Data version number */
> uint64_t st_ioc_flags; /* As FS_IOC_GETFLAGS */
> /* 0x90 */
> uint64_t __spare1[13]; /* Spare space for future expansion */
> /* 0x100 */
> };
>
> And an fs information structure for less commonly needed data:
>
> struct statx_fsinfo {
> /* 0x00 - General info */
> uint32_t st_mask; /* What optional fields are filled in */
> uint32_t st_type; /* Filesystem type from linux/magic.h */
>
> /* 0x08 - file timestamp granularity info */
> uint16_t st_atime_gran_mantissa; /* gran(secs) = mant * 10^exp */
> uint16_t st_btime_gran_mantissa;
> uint16_t st_ctime_gran_mantissa;
> uint16_t st_mtime_gran_mantissa;
> /* 0x10 */
> int8_t st_atime_gran_exponent;
> int8_t st_btime_gran_exponent;
> int8_t st_ctime_gran_exponent;
> int8_t st_mtime_gran_exponent;
>
> /* 0x14 - I/O parameters */
> uint32_t st_blksize; /* File block size */
> uint32_t st_alloc_blksize; /* Allocation block size/alignment */
> uint32_t st_small_io_size; /* IO size/alignment that avoids fs/page cache RMW */
> uint32_t st_pref_io_size; /* Preferred IO size for general usage */
> uint32_t st_large_io_size; /* IO size/alignment for high bandwidth sequential IO */

That's per file information, not per filesystem. XFS definitely
needs this IO information per-file....

>
> /* 0x28 - Restrictions on struct statx contents */
> uint64_t st_supported_ioc_flags; /* FS_IOC_GETFLAGS flags supported */
>
> /* 0x30 - Volume/filesystem information */
> uint64_t st_fsid; /* Short 64-bit Filesystem ID (as statfs) */
> uint64_t __spare0[3];
> /* 0x50 */
> uint8_t st_volume_id[16]; /* Volume/fs identifier */
> uint8_t st_volume_uuid[16]; /* Volume/fs UUID */

And there's all the remaining information needed to construct file
NFS handles without root priviledges...

> /* 0x80 */
> uint64_t __spare1[8];
> /* 0xc0 */
> uint8_t st_volume_name[64]; /* Volume name (up to 64 chars) */
> /* 0x100 */
> uint8_t st_domain_name[256]; /* Domain/cell/workgroup name (up to 256 chars) */
> /* 0x200 */
> };
>
> One could argue a bit over what goes in which, should we go for this. This
> may be better split between multiple syscalls though (with the race that that
> implies) and potentially merging with statfs.
>
>
> The statxat() syscall [n?e xstat] could then use the 6th parameter thusly:
>
> asmlinkage long sys_statxat(int dfd, const char __user *path, unsigned flags,
> unsigned mask, struct statx __user *buffer,
> struct statx_fsinfo __user *fsinfo);
>
>
> letting fsinfo be NULL to indicate a lack of interest. I'm not sure we want
> to do that, though.
>
>
> Also, do Dave Chinner's ideas for indicating five I/O parameters want to be
> 32-bit numbers? Larger? Smaller? Can they be log2?

Definitely 32 bit, IMO, as it's not uncommon to see optimal IO sizes
in the tens of megabytes on large, high bandwidth storage systems.
As for being log2 - that's just making it more complex to use and
making code ugly - we'd have to convert to log2 in kernel, then
convert back in every single application....

> Note also, that I've suggested that we represent the timestamp granularity
> information as a decimal float (which requires 3 bytes per timestamp) and that
> we provide separate granularities for each timestamp.
>
> David
>

--
Dave Chinner
[email protected]

2012-05-09 01:09:41

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Extended file stat: Splitting file- and fs-specific info?

On Wed, May 09, 2012 at 10:24:20AM +1000, Dave Chinner wrote:
> On Tue, May 08, 2012 at 09:19:42PM +0100, David Howells wrote:
> >
> > Should I split the file-specific info and the fs-specific info and make the
> > second optional? What I'm thinking of is something like this:
> >
> > Have a file information structure:
> >
> > struct statx {
> > /* 0x00 */
> > uint32_t st_mask; /* What results were written */
> > uint32_t st_information; /* Information about the file */
> > uint16_t st_mode; /* File mode */
> > uint16_t __spare0[3];
> > /* 0x10 */
> > uint32_t st_uid; /* User ID of owner */
> > uint32_t st_gid; /* Group ID of owner */
> > uint32_t st_nlink; /* Number of hard links */
> > uint32_t st_blksize; /* Optimal size for filesystem I/O */
> > /* 0x20 */
> > struct statx_dev st_rdev; /* Device ID of special file */
> > struct statx_dev st_dev; /* ID of device containing file */
> > /* 0x30 */
> > int32_t st_atime_ns; /* Last access time (ns part) */
> > int32_t st_btime_ns; /* File creation time (ns part) */
> > int32_t st_ctime_ns; /* Last attribute change time (ns part) */
> > int32_t st_mtime_ns; /* Last data modification time (ns part) */
> > /* 0x40 */
> > int64_t st_atime; /* Last access time */
> > int64_t st_btime; /* File creation time */
> > int64_t st_ctime; /* Last attribute change time */
> > int64_t st_mtime; /* Last data modification time */
> > /* 0x60 */
> > uint64_t st_ino; /* Inode number */
> > uint64_t st_size; /* File size */
> > uint64_t st_blocks; /* Number of 512-byte blocks allocated */
> > uint64_t st_gen; /* Inode generation number */
>
> I don't think we want to expose the inode generation numbers. It is
> trivial to construct NFS file handles (usually just fsid, inode
> number and generation) with that information and hence bypass
> security checks to access files.

I'm not convinced there's much value in trying to keep filehandles
secret.

If you're going to base your security on a secret, then it should be
hard to guess, easy to keep secret, and changeable in case it ever does
get out.

Filehandles are pretty easy to guess (even without help like this), they
usually go over the wire in cleartext, and they can't be changed.

--b.

>
> > uint64_t st_version; /* Data version number */
> > uint64_t st_ioc_flags; /* As FS_IOC_GETFLAGS */
> > /* 0x90 */
> > uint64_t __spare1[13]; /* Spare space for future expansion */
> > /* 0x100 */
> > };
> >
> > And an fs information structure for less commonly needed data:
> >
> > struct statx_fsinfo {
> > /* 0x00 - General info */
> > uint32_t st_mask; /* What optional fields are filled in */
> > uint32_t st_type; /* Filesystem type from linux/magic.h */
> >
> > /* 0x08 - file timestamp granularity info */
> > uint16_t st_atime_gran_mantissa; /* gran(secs) = mant * 10^exp */
> > uint16_t st_btime_gran_mantissa;
> > uint16_t st_ctime_gran_mantissa;
> > uint16_t st_mtime_gran_mantissa;
> > /* 0x10 */
> > int8_t st_atime_gran_exponent;
> > int8_t st_btime_gran_exponent;
> > int8_t st_ctime_gran_exponent;
> > int8_t st_mtime_gran_exponent;
> >
> > /* 0x14 - I/O parameters */
> > uint32_t st_blksize; /* File block size */
> > uint32_t st_alloc_blksize; /* Allocation block size/alignment */
> > uint32_t st_small_io_size; /* IO size/alignment that avoids fs/page cache RMW */
> > uint32_t st_pref_io_size; /* Preferred IO size for general usage */
> > uint32_t st_large_io_size; /* IO size/alignment for high bandwidth sequential IO */
>
> That's per file information, not per filesystem. XFS definitely
> needs this IO information per-file....
>
> >
> > /* 0x28 - Restrictions on struct statx contents */
> > uint64_t st_supported_ioc_flags; /* FS_IOC_GETFLAGS flags supported */
> >
> > /* 0x30 - Volume/filesystem information */
> > uint64_t st_fsid; /* Short 64-bit Filesystem ID (as statfs) */
> > uint64_t __spare0[3];
> > /* 0x50 */
> > uint8_t st_volume_id[16]; /* Volume/fs identifier */
> > uint8_t st_volume_uuid[16]; /* Volume/fs UUID */
>
> And there's all the remaining information needed to construct file
> NFS handles without root priviledges...
>
> > /* 0x80 */
> > uint64_t __spare1[8];
> > /* 0xc0 */
> > uint8_t st_volume_name[64]; /* Volume name (up to 64 chars) */
> > /* 0x100 */
> > uint8_t st_domain_name[256]; /* Domain/cell/workgroup name (up to 256 chars) */
> > /* 0x200 */
> > };
> >
> > One could argue a bit over what goes in which, should we go for this. This
> > may be better split between multiple syscalls though (with the race that that
> > implies) and potentially merging with statfs.
> >
> >
> > The statxat() syscall [née xstat] could then use the 6th parameter thusly:
> >
> > asmlinkage long sys_statxat(int dfd, const char __user *path, unsigned flags,
> > unsigned mask, struct statx __user *buffer,
> > struct statx_fsinfo __user *fsinfo);
> >
> >
> > letting fsinfo be NULL to indicate a lack of interest. I'm not sure we want
> > to do that, though.
> >
> >
> > Also, do Dave Chinner's ideas for indicating five I/O parameters want to be
> > 32-bit numbers? Larger? Smaller? Can they be log2?
>
> Definitely 32 bit, IMO, as it's not uncommon to see optimal IO sizes
> in the tens of megabytes on large, high bandwidth storage systems.
> As for being log2 - that's just making it more complex to use and
> making code ugly - we'd have to convert to log2 in kernel, then
> convert back in every single application....
>
> > Note also, that I've suggested that we represent the timestamp granularity
> > information as a decimal float (which requires 3 bytes per timestamp) and that
> > we provide separate granularities for each timestamp.
> >
> > David
> >
>
> --
> Dave Chinner
> [email protected]

2012-05-09 01:16:30

by Andreas Dilger

[permalink] [raw]
Subject: Re: Extended file stat: Splitting file- and fs-specific info?

On 2012-05-08, at 18:24, Dave Chinner <[email protected]> wrote:
> On Tue, May 08, 2012 at 09:19:42PM +0100, David Howells wrote:
>>
>> Should I split the file-specific info and the fs-specific info and make the
>> second optional? What I'm thinking of is something like this:
>>
>> Have a file information structure:
>>
>> struct statx {
>> /* 0x00 */
>> uint32_t st_mask; /* What results were written */
>> uint32_t st_information; /* Information about the file */
>> uint16_t st_mode; /* File mode */
>> uint16_t __spare0[3];
>> /* 0x10 */
>> uint32_t st_uid; /* User ID of owner */
>> uint32_t st_gid; /* Group ID of owner */
>> uint32_t st_nlink; /* Number of hard links */
>> uint32_t st_blksize; /* Optimal size for filesystem I/O */
>> /* 0x20 */
>> struct statx_dev st_rdev; /* Device ID of special file */
>> struct statx_dev st_dev; /* ID of device containing file */
>> /* 0x30 */
>> int32_t st_atime_ns; /* Last access time (ns part) */
>> int32_t st_btime_ns; /* File creation time (ns part) */
>> int32_t st_ctime_ns; /* Last attribute change time (ns part) */
>> int32_t st_mtime_ns; /* Last data modification time (ns part) */
>> /* 0x40 */
>> int64_t st_atime; /* Last access time */
>> int64_t st_btime; /* File creation time */
>> int64_t st_ctime; /* Last attribute change time */
>> int64_t st_mtime; /* Last data modification time */
>> /* 0x60 */
>> uint64_t st_ino; /* Inode number */
>> uint64_t st_size; /* File size */
>> uint64_t st_blocks; /* Number of 512-byte blocks allocated */
>> uint64_t st_gen; /* Inode generation number */
>
> I don't think we want to expose the inode generation numbers. It is
> trivial to construct NFS file handles (usually just fsid, inode
> number and generation) with that information and hence bypass
> security checks to access files.

At the same time, if the user can stay the file and get this information, that isn't making the sole any less secure than if they can access the file via NFS. They have already passed all of the pathname security checks by the time they can do a statxat() on the file.

>> uint64_t st_version; /* Data version number */
>> uint64_t st_ioc_flags; /* As FS_IOC_GETFLAGS */
>> /* 0x90 */
>> uint64_t __spare1[13]; /* Spare space for future expansion */
>> /* 0x100 */
>> };
>>
>> And an fs information structure for less commonly needed data:

One comment on this struct is that it would probably be better to use sx_ or sf_ as the prefix for these fields.

>> struct statx_fsinfo {
>> /* 0x00 - General info */
>> uint32_t st_mask; /* What optional fields are filled in */
>> uint32_t st_type; /* Filesystem type from linux/magic.h */
>>
>> /* 0x08 - file timestamp granularity info */
>> uint16_t st_atime_gran_mantissa; /* gran(secs) = mant * 10^exp */
>> uint16_t st_btime_gran_mantissa;
>> uint16_t st_ctime_gran_mantissa;
>> uint16_t st_mtime_gran_mantissa;
>> /* 0x10 */
>> int8_t st_atime_gran_exponent;
>> int8_t st_btime_gran_exponent;
>> int8_t st_ctime_gran_exponent;
>> int8_t st_mtime_gran_exponent;
>>
>> /* 0x14 - I/O parameters */
>> uint32_t st_blksize; /* File block size */
>> uint32_t st_alloc_blksize; /* Allocation block size/alignment */
>> uint32_t st_small_io_size; /* IO size/alignment that avoids fs/page cache RMW */
>> uint32_t st_pref_io_size; /* Preferred IO size for general usage */
>> uint32_t st_large_io_size; /* IO size/alignment for high bandwidth sequential IO */
>
> That's per file information, not per filesystem. XFS definitely
> needs this IO information per-file....

Definitely. Lustre can have wildly different layouts for each file. This will also avoid duplication of st_blksize in both the statx and statx_fsinfo structs.

The main question I have about these fields is what the difference is between st_blksize and st_alloc_blksize?


>> /* 0x28 - Restrictions on struct statx contents */
>> uint64_t st_supported_ioc_flags; /* FS_IOC_GETFLAGS flags supported */
>>
>> /* 0x30 - Volume/filesystem information */
>> uint64_t st_fsid; /* Short 64-bit Filesystem ID (as statfs) */
>> uint64_t __spare0[3];
>> /* 0x50 */
>> uint8_t st_volume_id[16]; /* Volume/fs identifier */
>> uint8_t st_volume_uuid[16]; /* Volume/fs UUID */
>
> And there's all the remaining information needed to construct file
> NFS handles without root priviledges...
>
>> /* 0x80 */
>> uint64_t __spare1[8];
>> /* 0xc0 */
>> uint8_t st_volume_name[64]; /* Volume name (up to 64 chars) */
>> /* 0x100 */
>> uint8_t st_domain_name[256]; /* Domain/cell/workgroup name (up to 256 chars) */
>> /* 0x200 */
>> };
>>
>> One could argue a bit over what goes in which, should we go for this. This
>> may be better split between multiple syscalls though (with the race that that
>> implies) and potentially merging with statfs.
>>
>>
>> The statxat() syscall [née xstat] could then use the 6th parameter thusly:
>>
>> asmlinkage long sys_statxat(int dfd, const char __user *path, unsigned flags,
>> unsigned mask, struct statx __user *buffer,
>> struct statx_fsinfo __user *fsinfo);
>>
>>
>> letting fsinfo be NULL to indicate a lack of interest. I'm not sure we want
>> to do that, though.
>>
>>
>> Also, do Dave Chinner's ideas for indicating five I/O parameters want to be
>> 32-bit numbers? Larger? Smaller? Can they be log2?
>
> Definitely 32 bit, IMO, as it's not uncommon to see optimal IO sizes
> in the tens of megabytes on large, high bandwidth storage systems.
> As for being log2 - that's just making it more complex to use and
> making code ugly - we'd have to convert to log2 in kernel, then
> convert back in every single application....
>
>> Note also, that I've suggested that we represent the timestamp granularity
>> information as a decimal float (which requires 3 bytes per timestamp) and that
>> we provide separate granularities for each timestamp.
>>
>> David
>>
>
> --
> Dave Chinner
> [email protected]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-05-09 04:25:32

by Dave Chinner

[permalink] [raw]
Subject: Re: Extended file stat: Splitting file- and fs-specific info?

On Tue, May 08, 2012 at 09:09:41PM -0400, J. Bruce Fields wrote:
> On Wed, May 09, 2012 at 10:24:20AM +1000, Dave Chinner wrote:
> > On Tue, May 08, 2012 at 09:19:42PM +0100, David Howells wrote:
> > >
> > > Should I split the file-specific info and the fs-specific info and make the
> > > second optional? What I'm thinking of is something like this:
> > >
> > > Have a file information structure:
> > >
> > > struct statx {
> > > /* 0x00 */
> > > uint32_t st_mask; /* What results were written */
> > > uint32_t st_information; /* Information about the file */
> > > uint16_t st_mode; /* File mode */
> > > uint16_t __spare0[3];
> > > /* 0x10 */
> > > uint32_t st_uid; /* User ID of owner */
> > > uint32_t st_gid; /* Group ID of owner */
> > > uint32_t st_nlink; /* Number of hard links */
> > > uint32_t st_blksize; /* Optimal size for filesystem I/O */
> > > /* 0x20 */
> > > struct statx_dev st_rdev; /* Device ID of special file */
> > > struct statx_dev st_dev; /* ID of device containing file */
> > > /* 0x30 */
> > > int32_t st_atime_ns; /* Last access time (ns part) */
> > > int32_t st_btime_ns; /* File creation time (ns part) */
> > > int32_t st_ctime_ns; /* Last attribute change time (ns part) */
> > > int32_t st_mtime_ns; /* Last data modification time (ns part) */
> > > /* 0x40 */
> > > int64_t st_atime; /* Last access time */
> > > int64_t st_btime; /* File creation time */
> > > int64_t st_ctime; /* Last attribute change time */
> > > int64_t st_mtime; /* Last data modification time */
> > > /* 0x60 */
> > > uint64_t st_ino; /* Inode number */
> > > uint64_t st_size; /* File size */
> > > uint64_t st_blocks; /* Number of 512-byte blocks allocated */
> > > uint64_t st_gen; /* Inode generation number */
> >
> > I don't think we want to expose the inode generation numbers. It is
> > trivial to construct NFS file handles (usually just fsid, inode
> > number and generation) with that information and hence bypass
> > security checks to access files.
>
> I'm not convinced there's much value in trying to keep filehandles
> secret.

Sure, but I can't really see any good reason to expose filesystem
internal implementation details like this - a generation number is
usually used to differentiate between inode life cycles which
userspace has no concept of and is different for every filesystem,
so it's behaviour and values are not going to be consistent across
filesystems. Some filesystems might not even have a generation
number they can export, and that makes me wonder if there is any
good reason for exposing it at all.

If you need to discriminate between versions of files with the same
name, then use name_to_handle_at() and compare filehandles....

Cheers,

Dave.

--
Dave Chinner
[email protected]

2012-05-09 09:21:14

by David Howells

[permalink] [raw]
Subject: Re: Extended file stat: Splitting file- and fs-specific info?

Dave Chinner <[email protected]> wrote:

> I don't think we want to expose the inode generation numbers. It is
> trivial to construct NFS file handles (usually just fsid, inode
> number and generation) with that information and hence bypass
> security checks to access files.

I was asked for it by Bernd Schubert for userspace NFS servers and FUSE -
maybe he can say what he wants it for.

I also have a note that Jeremy Allison asked for it, but I can't find where or
why, so that might be an error.

It looks like FreeBSD do have an st_gen field in their stat struct, but it's
only filled in for root. Maybe I could do something like that?

David

2012-05-09 11:14:20

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Extended file stat: Splitting file- and fs-specific info?

On Wed, May 09, 2012 at 02:25:32PM +1000, Dave Chinner wrote:
> On Tue, May 08, 2012 at 09:09:41PM -0400, J. Bruce Fields wrote:
> > On Wed, May 09, 2012 at 10:24:20AM +1000, Dave Chinner wrote:
> > > On Tue, May 08, 2012 at 09:19:42PM +0100, David Howells wrote:
> > > >
> > > > Should I split the file-specific info and the fs-specific info and make the
> > > > second optional? What I'm thinking of is something like this:
> > > >
> > > > Have a file information structure:
> > > >
> > > > struct statx {
> > > > /* 0x00 */
> > > > uint32_t st_mask; /* What results were written */
> > > > uint32_t st_information; /* Information about the file */
> > > > uint16_t st_mode; /* File mode */
> > > > uint16_t __spare0[3];
> > > > /* 0x10 */
> > > > uint32_t st_uid; /* User ID of owner */
> > > > uint32_t st_gid; /* Group ID of owner */
> > > > uint32_t st_nlink; /* Number of hard links */
> > > > uint32_t st_blksize; /* Optimal size for filesystem I/O */
> > > > /* 0x20 */
> > > > struct statx_dev st_rdev; /* Device ID of special file */
> > > > struct statx_dev st_dev; /* ID of device containing file */
> > > > /* 0x30 */
> > > > int32_t st_atime_ns; /* Last access time (ns part) */
> > > > int32_t st_btime_ns; /* File creation time (ns part) */
> > > > int32_t st_ctime_ns; /* Last attribute change time (ns part) */
> > > > int32_t st_mtime_ns; /* Last data modification time (ns part) */
> > > > /* 0x40 */
> > > > int64_t st_atime; /* Last access time */
> > > > int64_t st_btime; /* File creation time */
> > > > int64_t st_ctime; /* Last attribute change time */
> > > > int64_t st_mtime; /* Last data modification time */
> > > > /* 0x60 */
> > > > uint64_t st_ino; /* Inode number */
> > > > uint64_t st_size; /* File size */
> > > > uint64_t st_blocks; /* Number of 512-byte blocks allocated */
> > > > uint64_t st_gen; /* Inode generation number */
> > >
> > > I don't think we want to expose the inode generation numbers. It is
> > > trivial to construct NFS file handles (usually just fsid, inode
> > > number and generation) with that information and hence bypass
> > > security checks to access files.
> >
> > I'm not convinced there's much value in trying to keep filehandles
> > secret.
>
> Sure, but I can't really see any good reason to expose filesystem
> internal implementation details like this - a generation number is
> usually used to differentiate between inode life cycles which
> userspace has no concept of and is different for every filesystem,
> so it's behaviour and values are not going to be consistent across
> filesystems.

That's OK. The only requirement would be that the (inode number, inode
generation) pair be different for different inodes on the same
filesystem.

> Some filesystems might not even have a generation
> number they can export, and that makes me wonder if there is any
> good reason for exposing it at all.

That's true of a number of these new attributes.

> If you need to discriminate between versions of files with the same
> name, then use name_to_handle_at() and compare filehandles....

Sure.

Since the only use case given for this has been constructing
filehandles, and since we already have an interface for that, I don't
feel particularly strongly about this.

--b.

2012-05-09 11:19:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Extended file stat: Splitting file- and fs-specific info?

On Wed, May 09, 2012 at 10:21:14AM +0100, David Howells wrote:
> Dave Chinner <[email protected]> wrote:
>
> > I don't think we want to expose the inode generation numbers. It is
> > trivial to construct NFS file handles (usually just fsid, inode
> > number and generation) with that information and hence bypass
> > security checks to access files.
>
> I was asked for it by Bernd Schubert for userspace NFS servers and FUSE -
> maybe he can say what he wants it for.

It's entirely broken, as a generation number might be part of the file
handle (and for Linux-like filesystems normally is), but it's entirely
up to the filesystem to decide how it works. That's why we added system
calls to do operations on opaque file handles that the file system
controls. Exposing a completely meaningless "generation" is a bad idea.

2012-05-09 11:55:16

by Bernd Schubert

[permalink] [raw]
Subject: Re: Extended file stat: Splitting file- and fs-specific info?

On 05/09/2012 01:19 PM, Christoph Hellwig wrote:
> On Wed, May 09, 2012 at 10:21:14AM +0100, David Howells wrote:
>> Dave Chinner<[email protected]> wrote:
>>
>>> I don't think we want to expose the inode generation numbers. It is
>>> trivial to construct NFS file handles (usually just fsid, inode
>>> number and generation) with that information and hence bypass
>>> security checks to access files.
>>
>> I was asked for it by Bernd Schubert for userspace NFS servers and FUSE -
>> maybe he can say what he wants it for.
>
> It's entirely broken, as a generation number might be part of the file
> handle (and for Linux-like filesystems normally is), but it's entirely
> up to the filesystem to decide how it works. That's why we added system
> calls to do operations on opaque file handles that the file system
> controls. Exposing a completely meaningless "generation" is a bad idea.
>

The basic idea of generation numbers is to check if an inode was
recycled, so only if the tuple of inode-number and generation-number
matches we still have the same file. Kernel nfs uses that and unfs3 uses
it via EXT2_IOC_GETVERSION, which has the overhead of an additional
syscall. Unionfs-fuse usually keeps file open, however, it might run out
of the maximum allowed files and I plan to add a mode to close and
re-open files as failback mode. For that the definite knowledge if a
file/inode is still the very same and the inode was not just recycled is
crucial.

All of that being said, I think with open_by_handle_at() syscall we
don't need the inode generation number any more.



Cheers,
Bernd

2012-05-09 12:05:44

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Extended file stat: Splitting file- and fs-specific info?

On Wed, May 09, 2012 at 01:55:16PM +0200, Bernd Schubert wrote:
> The basic idea of generation numbers is to check if an inode was
> recycled, so only if the tuple of inode-number and generation-number
> matches we still have the same file. Kernel nfs

NFS does not and should not look at the inode generation. Except for a
bit of legacy code for the old pre-Linux 2.4 filehandles it looks at the
opaque file handle returned and only interpreted by the filesystem. Any
userspace NFS server should do the same.

2012-05-09 12:25:20

by Bernd Schubert

[permalink] [raw]
Subject: Re: Extended file stat: Splitting file- and fs-specific info?

On 05/09/2012 02:05 PM, Christoph Hellwig wrote:
> On Wed, May 09, 2012 at 01:55:16PM +0200, Bernd Schubert wrote:
>> The basic idea of generation numbers is to check if an inode was
>> recycled, so only if the tuple of inode-number and generation-number
>> matches we still have the same file. Kernel nfs
>
> NFS does not and should not look at the inode generation. Except for a
> bit of legacy code for the old pre-Linux 2.4 filehandles it looks at the
> opaque file handle returned and only interpreted by the filesystem. Any
> userspace NFS server should do the same.

Ok, I didn't look how kernel NFS does it for quite some time already...
User space NFS only can do it beginning with 2.6.39 - given that user
space also needs to support older kernels and other OSs, which might not
have open_by_handle, userspace unfortunately cannot entirely rely on
that feature.

Cheers,
Bernd

2012-05-09 13:51:23

by Andreas Dilger

[permalink] [raw]
Subject: Re: Extended file stat: Splitting file- and fs-specific info?

On 2012-05-09, at 6:25 AM, Bernd Schubert wrote:
> On 05/09/2012 02:05 PM, Christoph Hellwig wrote:
>> On Wed, May 09, 2012 at 01:55:16PM +0200, Bernd Schubert wrote:
>>> The basic idea of generation numbers is to check if an inode was
>>> recycled, so only if the tuple of inode-number and generation-number
>>> matches we still have the same file. Kernel nfs
>>
>> NFS does not and should not look at the inode generation. Except for a
>> bit of legacy code for the old pre-Linux 2.4 filehandles it looks at the
>> opaque file handle returned and only interpreted by the filesystem. Any
>> userspace NFS server should do the same.
>
> Ok, I didn't look how kernel NFS does it for quite some time already...
> User space NFS only can do it beginning with 2.6.39 - given that user space also needs to support older kernels and other OSs, which might not have open_by_handle, userspace unfortunately cannot entirely rely on that feature.

But even fewer kernels have sys_statxat() in them (i.e. none), so you can rely on that even less than open_by_handle()...

Cheers, Andreas






2012-05-09 14:13:13

by Bernd Schubert

[permalink] [raw]
Subject: Re: Extended file stat: Splitting file- and fs-specific info?

On 05/09/2012 03:51 PM, Andreas Dilger wrote:
> On 2012-05-09, at 6:25 AM, Bernd Schubert wrote:
>> On 05/09/2012 02:05 PM, Christoph Hellwig wrote:
>>> On Wed, May 09, 2012 at 01:55:16PM +0200, Bernd Schubert wrote:
>>>> The basic idea of generation numbers is to check if an inode was
>>>> recycled, so only if the tuple of inode-number and generation-number
>>>> matches we still have the same file. Kernel nfs
>>>
>>> NFS does not and should not look at the inode generation. Except for a
>>> bit of legacy code for the old pre-Linux 2.4 filehandles it looks at the
>>> opaque file handle returned and only interpreted by the filesystem. Any
>>> userspace NFS server should do the same.
>>
>> Ok, I didn't look how kernel NFS does it for quite some time already...
>> User space NFS only can do it beginning with 2.6.39 - given that user space also needs to support older kernels and other OSs, which might not have open_by_handle, userspace unfortunately cannot entirely rely on that feature.
>
> But even fewer kernels have sys_statxat() in them (i.e. none), so you can rely on that even less than open_by_handle()...

Well, I didn't say that :) In summary, an application needs to try to
use the open-by-handle call and if that is not supported, it has to fall
back to traditional stat and generation-number-ioctl. And as I said
before, open-by-handle very likely removes the requirement for
generation numbers in sys_statxat().


Cheers,
Bernd


2012-05-10 09:14:11

by David Howells

[permalink] [raw]
Subject: Re: Extended file stat: Splitting file- and fs-specific info?

Dave Chinner <[email protected]> wrote:

> > Also, do Dave Chinner's ideas for indicating five I/O parameters want to be
> > 32-bit numbers? Larger? Smaller? Can they be log2?
>
> Definitely 32 bit, IMO, as it's not uncommon to see optimal IO sizes
> in the tens of megabytes on large, high bandwidth storage systems.
> As for being log2 - that's just making it more complex to use and
> making code ugly - we'd have to convert to log2 in kernel, then
> convert back in every single application....

ilog2() in the kernel uses the CPU's bit-scan instruction if it has one and
converting back is just a bitshift operator.

But let's go with 32-bit fields for the moment. I presume we aren't worried
about a driver that wants to do a 4GB transfer...

David

2012-05-10 09:23:14

by David Howells

[permalink] [raw]
Subject: Re: Extended file stat: Splitting file- and fs-specific info?

Andreas Dilger <[email protected]> wrote:

> The main question I have about these fields is what the difference is
> between st_blksize and st_alloc_blksize?

I would assume that st_blksize is some ideal I/O size, not the allocation
block size. Dave Chinner might be able to answer better as he suggested these
I/O information fields.

David

2012-05-10 16:05:28

by Andreas Dilger

[permalink] [raw]
Subject: Re: Extended file stat: Splitting file- and fs-specific info?

On 2012-05-10, at 3:23 AM, David Howells wrote:
> Andreas Dilger <[email protected]> wrote:
>> The main question I have about these fields is what the difference is
>> between st_blksize and st_alloc_blksize?
>
> I would assume that st_blksize is some ideal I/O size, not the
> allocation block size.

But there are several optimal IO sizes, so unless there is a clear
description of what each of these fields is intended to be used for,
it doesn't make sense to add them in.

> Dave Chinner might be able to answer better as he suggested these
> I/O information fields.
>
> David


Cheers, Andreas

2012-05-10 17:10:16

by Roland McGrath

[permalink] [raw]
Subject: Re: Extended file stat: Splitting file- and fs-specific info?

POSIX specifies st_blksize thusly: "A file system-specific preferred I/O
block size for this object. In some file system types, this may vary from
file to file."

Since there is only one available to POSIX applications, it should map to
the one that's described as "preferred IO size for general usage".


Thanks,
Roland

2012-05-11 08:54:13

by Andreas Dilger

[permalink] [raw]
Subject: Re: Extended file stat: Splitting file- and fs-specific info?

On 2012-05-10, at 11:10 AM, Roland McGrath wrote:
> POSIX specifies st_blksize thusly: "A file system-specific preferred
> I/O block size for this object. In some file system types, this may
> vary from file to file."
>
> Since there is only one available to POSIX applications, it should map
> to the one that's described as "preferred IO size for general usage".

Sure, but statxat() isn't a POSIX API.

While I agree with the idea that there should be enough information about the underlying layout for applications to be able to submit good IO, it doesn't help if we have a bunch of extra fields that have vague meanings.

They will get filled in by the filesystem in a haphazard way, and will not be used by application developers that don't understand what they mean.

Cheers, Andreas