LinuxLists.cc - [rfc] new stat*fs-like syscall?

2010-06-24 13:15:05

Subject: [rfc] new stat*fs-like syscall?

This has come up a few times in the past, and I'd like to try to get
an agreement on it. statvfs(2) importantly contains f_flag (mount
flags), and is encouraged to use rather than statfs(2). The kernel
provides a statfs syscall only.

This means glibc has to provide f_flag support by parsing /proc/mounts
and stat(2)ing mount points. This is really slow, and /proc/mounts is
hard for the kernel to provide. It's actually the last scalability
bottleneck in the core vfs for dbench (samba) after my patches.

Not only that, but it's racy.

Other than types, other differences are:
- statvfs(2) has is f_frsize, which seems fairly useless.
- statvfs(2) has f_favail.
- statfs(2) f_bsize is optimal transfer block, statvfs(2) f_bsize is fs
block size. The latter could be useful for disk space algorithms.
Both can be ill defned.
- statvfs(2) lacks f_type.

Is there anything more we should add here? Samba wants a capabilities
field, with things like sparse files, quotas, compression, encryption,
case preserving/sensitive.

Any thoughts?

Thanks,
Nick

2010-06-24 14:03:18

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On Thu, 24 Jun 2010, Nick Piggin wrote:
> This has come up a few times in the past, and I'd like to try to get
> an agreement on it. statvfs(2) importantly contains f_flag (mount
> flags), and is encouraged to use rather than statfs(2). The kernel
> provides a statfs syscall only.
>
> This means glibc has to provide f_flag support by parsing /proc/mounts
> and stat(2)ing mount points. This is really slow, and /proc/mounts is
> hard for the kernel to provide. It's actually the last scalability
> bottleneck in the core vfs for dbench (samba) after my patches.
>
> Not only that, but it's racy.
>
> Other than types, other differences are:
> - statvfs(2) has is f_frsize, which seems fairly useless.

statfs(2) also has f_frsize since 2.6.0, only it hasn't been
documented (should be fixed now).

> - statvfs(2) has f_favail.
> - statfs(2) f_bsize is optimal transfer block, statvfs(2) f_bsize is fs
> block size. The latter could be useful for disk space algorithms.
> Both can be ill defned.

They are the same, only the documentation is different.

> - statvfs(2) lacks f_type.
>
> Is there anything more we should add here? Samba wants a capabilities
> field, with things like sparse files, quotas, compression, encryption,
> case preserving/sensitive.
>
> Any thoughts?

"struct statfs" and "struct statfs64" have spare fields. We could put
the f_flag in there including a magic "this is a valid f_flag" flag,
that distinguishes from the default zero value.

Thanks,
Miklos

2010-06-24 14:15:48

by Andrew Lutomirski

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

Nick Piggin wrote:
> This has come up a few times in the past, and I'd like to try to get
> an agreement on it. statvfs(2) importantly contains f_flag (mount
> flags), and is encouraged to use rather than statfs(2). The kernel
> provides a statfs syscall only.
>
> This means glibc has to provide f_flag support by parsing /proc/mounts
> and stat(2)ing mount points. This is really slow, and /proc/mounts is
> hard for the kernel to provide. It's actually the last scalability
> bottleneck in the core vfs for dbench (samba) after my patches.
>
> Not only that, but it's racy.
>
> Other than types, other differences are:
> - statvfs(2) has is f_frsize, which seems fairly useless.
> - statvfs(2) has f_favail.
> - statfs(2) f_bsize is optimal transfer block, statvfs(2) f_bsize is fs
> block size. The latter could be useful for disk space algorithms.
> Both can be ill defned.
> - statvfs(2) lacks f_type.
>
> Is there anything more we should add here? Samba wants a capabilities
> field, with things like sparse files, quotas, compression, encryption,
> case preserving/sensitive.
>
> Any thoughts?

Something like fsid but actually specified to uniquely identify a
superblock. (Currently, fsid seems to be set by the filesystem, and
nothing in particular ensures that two different filesystems couldn't
have collisions.) We could guarantee (or have a flag guaranteeing) that
(fsid, st_inode) actually uniquely identifies an inode.

Similarly, something like fsid that uniquely identifies the vfsmount
could be useful, although I don't know how easy that would be to provide
for fstat?fs.

If we could expose the complete set of filesystem mount options so that
mount(1) didn't have to look at /proc/self/mounts or /etc/mtab, then
playing with chroots would be that much easier.

Should we expose superblock and vfsmount options separately? We have
read-only bind mounts now, but the way they work is rather inscrutable,
and if stat?fs could say "superblock is read-write but vfsmount is
readonly" then people might be able to make more sense of what's going on.

--Andy

2010-06-24 14:19:18

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On Thu, 24 Jun 2010, Andy Lutomirski wrote:
> Something like fsid but actually specified to uniquely identify a
> superblock. (Currently, fsid seems to be set by the filesystem, and
> nothing in particular ensures that two different filesystems couldn't
> have collisions.) We could guarantee (or have a flag guaranteeing) that
> (fsid, st_inode) actually uniquely identifies an inode.
>
> Similarly, something like fsid that uniquely identifies the vfsmount
> could be useful, although I don't know how easy that would be to provide
> for fstat?fs.
>
> If we could expose the complete set of filesystem mount options so that
> mount(1) didn't have to look at /proc/self/mounts or /etc/mtab, then
> playing with chroots would be that much easier.
>
> Should we expose superblock and vfsmount options separately? We have
> read-only bind mounts now, but the way they work is rather inscrutable,
> and if stat?fs could say "superblock is read-write but vfsmount is
> readonly" then people might be able to make more sense of what's going on.

You'll find all of those things in /proc/self/mountinfo.

Thanks,
Miklos

2010-06-24 14:36:31

by Nick Piggin

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On Thu, Jun 24, 2010 at 04:03:05PM +0200, Miklos Szeredi wrote:
> On Thu, 24 Jun 2010, Nick Piggin wrote:
> > This has come up a few times in the past, and I'd like to try to get
> > an agreement on it. statvfs(2) importantly contains f_flag (mount
> > flags), and is encouraged to use rather than statfs(2). The kernel
> > provides a statfs syscall only.
> >
> > This means glibc has to provide f_flag support by parsing /proc/mounts
> > and stat(2)ing mount points. This is really slow, and /proc/mounts is
> > hard for the kernel to provide. It's actually the last scalability
> > bottleneck in the core vfs for dbench (samba) after my patches.
> >
> > Not only that, but it's racy.
> >
> > Other than types, other differences are:
> > - statvfs(2) has is f_frsize, which seems fairly useless.
>
> statfs(2) also has f_frsize since 2.6.0, only it hasn't been
> documented (should be fixed now).
>
> > - statvfs(2) has f_favail.
> > - statfs(2) f_bsize is optimal transfer block, statvfs(2) f_bsize is fs
> > block size. The latter could be useful for disk space algorithms.
> > Both can be ill defned.
>
> They are the same, only the documentation is different.
>
> > - statvfs(2) lacks f_type.
> >
> > Is there anything more we should add here? Samba wants a capabilities
> > field, with things like sparse files, quotas, compression, encryption,
> > case preserving/sensitive.
> >
> > Any thoughts?
>
> "struct statfs" and "struct statfs64" have spare fields. We could put
> the f_flag in there including a magic "this is a valid f_flag" flag,
> that distinguishes from the default zero value.

Ah so it does. We have 5 words spare. So we should have a version
number rather than just do a per-word hack each time. We could
probably pack the version number into a few bits of f_flag though.

2010-06-24 14:37:55

by Andrew Lutomirski

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On Thu, Jun 24, 2010 at 10:18 AM, Miklos Szeredi <[email protected]> wrote:
> On Thu, 24 Jun 2010, Andy Lutomirski wrote:
>> Something like fsid but actually specified to uniquely identify a
>> superblock. ?(Currently, fsid seems to be set by the filesystem, and
>> nothing in particular ensures that two different filesystems couldn't
>> have collisions.) ?We could guarantee (or have a flag guaranteeing) that
>> (fsid, st_inode) actually uniquely identifies an inode.
>>
>> Similarly, something like fsid that uniquely identifies the vfsmount
>> could be useful, although I don't know how easy that would be to provide
>> for fstat?fs.
>>
>> If we could expose the complete set of filesystem mount options so that
>> mount(1) didn't have to look at /proc/self/mounts or /etc/mtab, then
>> playing with chroots would be that much easier.
>>
>> Should we expose superblock and vfsmount options separately? ?We have
>> read-only bind mounts now, but the way they work is rather inscrutable,
>> and if stat?fs could say "superblock is read-write but vfsmount is
>> readonly" then people might be able to make more sense of what's going on.
>
> You'll find all of those things in /proc/self/mountinfo.

Wasn't the point that /proc/self/mounts (and presumably
/proc/self/mountinfo) isn't scalable and we wanted a syscall to query
it efficiently (and racelessly)?

--Andy

2010-06-24 14:48:30

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On Thu, 24 Jun 2010, Andrew Lutomirski wrote:
> Wasn't the point that /proc/self/mounts (and presumably
> /proc/self/mountinfo) isn't scalable and we wanted a syscall to query
> it efficiently (and racelessly)?

The question was how to support statvfs() efficiently, and the only
thing missing there is f_flags which can easily be added to the
existing statfs() syscall.

A separate mount_info() syscall might possibly be useful, but that's
another story.

Thanks,
Miklos

2010-06-24 23:06:49

by Andreas Dilger

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On 2010-06-24, at 08:08, Andy Lutomirski wrote:
> Something like fsid but actually specified to uniquely identify a superblock. (Currently, fsid seems to be set by the filesystem, and nothing in particular ensures that two different filesystems couldn't have collisions.) We could guarantee (or have a flag guaranteeing) that (fsid, st_inode) actually uniquely identifies an inode.

I think the right solution for this issue is to (gradually) start enforcing the "uniqueness" of the UUID in the filesystem superblock. That is what it is supposed to be for. Using (fsid, st_inode) doesn't necessarily help anything, if "fsid" isn't unique, and the same "st_inode" number is used on two different mountpoints.

To start, tracking the UUID at mount time an printing a non-fatal error at mount time if the mounted UUID is not unique would help, as would having e.g. fsck track the UUIDs of the underlying filesystems and printing a non-fatal error if it hits a duplicate UUID.

At some point in the future, the kernel can be changed to refuse to mount a filesystem with a duplicate UUID. I believe mount.xfs already does this.

Cheers, Andreas

2010-06-24 23:13:41

by Andreas Dilger

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On 2010-06-24, at 07:14, Nick Piggin wrote:
> This means glibc has to provide f_flag support by parsing /proc/mounts
> and stat(2)ing mount points. This is really slow, and /proc/mounts is
> hard for the kernel to provide.

Not only that, but if a mountpoint is broken (e.g. remote NFS server) then the glibc stat of all the mountpoints can hang the statvfs() call even if there is no interest in that particular filesystem.

> It's actually the last scalability bottleneck in the core vfs for dbench (samba) after my patches.
>
> Not only that, but it's racy.
>
> Other than types, other differences are:
> - statvfs(2) has is f_frsize, which seems fairly useless.

Actually, we were just lamenting the fact that f_frsize is currently broken, because Lustre wants to export the IO size as 1MB for good RPC performance, but the underlying blocksize is 4kB (ext3 blocksize). Similarly, NFS might want to export the rsize/wsize of 32kB or 64kB even if the underlying filesystem blocksize is smaller.

> - statvfs(2) has f_favail.
> - statfs(2) f_bsize is optimal transfer block, statvfs(2) f_bsize is fs
> block size. The latter could be useful for disk space algorithms.
> Both can be ill defned.

According to POSIX, "f_bsize" is the blocksize, but unfortunately this was botched in the earlier Linux implementations so currently they are both set to the same value, and using anything other than that breaks userspace programs that get them mixed up.

> - statvfs(2) lacks f_type.
>
> Is there anything more we should add here? Samba wants a capabilities
> field, with things like sparse files, quotas, compression, encryption,
> case preserving/sensitive.

It wouldn't be a bad idea, but then you could get into issues of what exactly the above flags mean. That said, I think it is better to have broad categories of features that may be slightly ill-defined than having nothing at all.

Cheers, Andreas

2010-06-25 03:50:26

by Nick Piggin

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On Thu, Jun 24, 2010 at 04:48:20PM +0200, Miklos Szeredi wrote:
> On Thu, 24 Jun 2010, Andrew Lutomirski wrote:
> > Wasn't the point that /proc/self/mounts (and presumably
> > /proc/self/mountinfo) isn't scalable and we wanted a syscall to query
> > it efficiently (and racelessly)?
>
> The question was how to support statvfs() efficiently, and the only
> thing missing there is f_flags which can easily be added to the
> existing statfs() syscall.
>
> A separate mount_info() syscall might possibly be useful, but that's
> another story.

Native statvfs() support is my motivation, but I am thinking that if
we are going to introduce a new syscall (or version rev the statfs
syscall somehow), then we should think hard about what else we can do.

More superblock info should be possible, more detailed info like like
related mounts will be costlier, so that may be better off as a
different syscall.

2010-06-25 04:02:06

by Nick Piggin

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On Thu, Jun 24, 2010 at 05:13:38PM -0600, Andreas Dilger wrote:
> On 2010-06-24, at 07:14, Nick Piggin wrote:
> > This means glibc has to provide f_flag support by parsing /proc/mounts
> > and stat(2)ing mount points. This is really slow, and /proc/mounts is
> > hard for the kernel to provide.
>
> Not only that, but if a mountpoint is broken (e.g. remote NFS server) then the glibc stat of all the mountpoints can hang the statvfs() call even if there is no interest in that particular filesystem.

Good point.

> > It's actually the last scalability bottleneck in the core vfs for dbench (samba) after my patches.
> >
> > Not only that, but it's racy.
> >
> > Other than types, other differences are:
> > - statvfs(2) has is f_frsize, which seems fairly useless.
>
> Actually, we were just lamenting the fact that f_frsize is currently broken, because Lustre wants to export the IO size as 1MB for good RPC performance, but the underlying blocksize is 4kB (ext3 blocksize). Similarly, NFS might want to export the rsize/wsize of 32kB or 64kB even if the underlying filesystem blocksize is smaller.
>
> > - statvfs(2) has f_favail.
> > - statfs(2) f_bsize is optimal transfer block, statvfs(2) f_bsize is fs
> > block size. The latter could be useful for disk space algorithms.
> > Both can be ill defned.
>
> According to POSIX, "f_bsize" is the blocksize, but unfortunately this was botched in the earlier Linux implementations so currently they are both set to the same value, and using anything other than that breaks userspace programs that get them mixed up.

So is "frsize" supposed to be the optimal block size, or what?
f_bsize AFAIKS should be filesystem allocation block size because
apparently some programs require it to calculate size of file on
disk.

If we can't change existing suboptimal legacy things, then let's
introduce new APIs that do the right thing. Apps that care will
eventually start using eg. a new syscall.

>
> > - statvfs(2) lacks f_type.
> >
> > Is there anything more we should add here? Samba wants a capabilities
> > field, with things like sparse files, quotas, compression, encryption,
> > case preserving/sensitive.
>
> It wouldn't be a bad idea, but then you could get into issues of what exactly the above flags mean. That said, I think it is better to have broad categories of features that may be slightly ill-defined than having nothing at all.

Yes it would be tricky. I don't want to add features that will just
be useless or go unused, but I don't want to change the syscall API
just to add f_flags, without looking at other possibilities.

Thanks,
Nick

2010-06-25 04:33:39

by Jeff Garzik

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On 06/25/2010 12:01 AM, Nick Piggin wrote:
> So is "frsize" supposed to be the optimal block size, or what?
> f_bsize AFAIKS should be filesystem allocation block size because
> apparently some programs require it to calculate size of file on
> disk.
>
> If we can't change existing suboptimal legacy things, then let's
> introduce new APIs that do the right thing. Apps that care will
> eventually start using eg. a new syscall.
>
>>
>>> - statvfs(2) lacks f_type.
>>>
>>> Is there anything more we should add here? Samba wants a capabilities
>>> field, with things like sparse files, quotas, compression, encryption,
>>> case preserving/sensitive.
>>
>> It wouldn't be a bad idea, but then you could get into issues of what exactly the above flags mean. That said, I think it is better to have broad categories of features that may be slightly ill-defined than having nothing at all.
>
> Yes it would be tricky. I don't want to add features that will just
> be useless or go unused, but I don't want to change the syscall API
> just to add f_flags, without looking at other possibilities.

It would be nice to separate capabilities and fixed parameters (block
size) from statistics which change frequently (free space).

And are capabilities really suited to a C struct, at all? That seems
more suited to a key/value type interface, a la NFSv4 attributes.

Jeff

2010-06-25 06:38:04

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On Thu, Jun 24, 2010 at 05:06:45PM -0600, Andreas Dilger wrote:
> I think the right solution for this issue is to (gradually) start enforcing the "uniqueness" of the UUID in the filesystem superblock. That is what it is supposed to be for. Using (fsid, st_inode) doesn't necessarily help anything, if "fsid" isn't unique, and the same "st_inode" number is used on two different mountpoints.
>
> To start, tracking the UUID at mount time an printing a non-fatal error at mount time if the mounted UUID is not unique would help, as would having e.g. fsck track the UUIDs of the underlying filesystems and printing a non-fatal error if it hits a duplicate UUID.
>
> At some point in the future, the kernel can be changed to refuse to mount a filesystem with a duplicate UUID. I believe mount.xfs already does this.

Tracking and exposing the uuid to be exact. Having the full uuid in a
statfs/statvfs-like system call is one first step. And yes, XFS does
check the uuid during mount. But it's actually in kernelspace, not in a
mount helper which XFS doesn't have. Take a look at xfs_uuid_mount().

2010-06-25 17:47:33

by Andreas Dilger

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On 2010-06-24, at 22:01, Nick Piggin wrote:
> On Thu, Jun 24, 2010 at 05:13:38PM -0600, Andreas Dilger wrote:
>>> Other than types, other differences are:
>>> - statvfs(2) has is f_frsize, which seems fairly useless.
>>
>> Actually, we were just lamenting the fact that f_frsize is currently broken, because Lustre wants to export the IO size as 1MB for good RPC performance, but the underlying blocksize is 4kB (ext3 blocksize). Similarly, NFS might want to export the rsize/wsize of 32kB or 64kB even if the underlying filesystem blocksize is smaller.
>>
>>
>>> - statfs(2) f_bsize is optimal transfer block, statvfs(2) f_bsize is fs
>>> block size. The latter could be useful for disk space algorithms.
>>> Both can be ill defned.
>>
>> According to POSIX, "f_bsize" is the blocksize, but unfortunately this was

Doh, typo. "f_frsize" is the "blocksize" (i.e. the units of f_blocks), and "f_bsize" is the "optimal IO size".

The SUSv2 includes the following field definitions (not showing all of them):
> unsigned long f_bsize file system block size
> unsigned long f_frsize fundamental filesystem block size
> fsblkcnt_t f_blocks total number of blocks on file system
> in units of f_frsize

>> botched in the earlier Linux implementations so currently they are both set to the same value, and using anything other than that breaks userspace programs that get them mixed up.
>
> So is "frsize" supposed to be the optimal block size, or what?

No, "frsize" is the minimum allocation unit - it is "fragment size".

> f_bsize AFAIKS should be filesystem allocation block size because
> apparently some programs require it to calculate size of file on
> disk.

Using statvfs()/struct statvfs clearly documents that f_blocks is in units of f_frsize, but since this is a relatively new API on Linux, and statfs() used f_bsize for years to mean the same thing some applications are broken.

> If we can't change existing suboptimal legacy things, then let's
> introduce new APIs that do the right thing. Apps that care will
> eventually start using eg. a new syscall.

I'd rather NOT start a proliferation of redundant syscalls, since there is no expectation that they will be used correctly either, and it just makes applications less portable. I think it less effort to fix the few current applications using sys_statvfs() incorrectly to use f_frsize than to use some new linux-only syscall.

>> It wouldn't be a bad idea, but then you could get into issues of what exactly the above flags mean. That said, I think it is better to have broad categories of features that may be slightly ill-defined than having nothing at all.
>
> Yes it would be tricky. I don't want to add features that will just
> be useless or go unused, but I don't want to change the syscall API
> just to add f_flags, without looking at other possibilities.

SUSv2 only defines the flags ST_RDONLY and ST_NOSUID, and this is also what is documented in the Linux/BSD/OSX statvfs(3) man page. According to the Solaris statvfs(3) man page I found it additionally defines:

ST_NOTRUNC 0x04 /* does not truncate file names longer than
NAME_MAX */

Cheers, Andreas

2010-06-25 17:52:27

by Ulrich Drepper

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On Fri, Jun 25, 2010 at 10:47, Andreas Dilger <[email protected]> wrote:
> SUSv2 only defines the flags ST_RDONLY and ST_NOSUID, and this is also what is documented in the Linux/BSD/OSX statvfs(3) man page. According to the Solaris statvfs(3) man page I found it additionally defines:
>
> ST_NOTRUNC 0x04 /* does not truncate file names longer than
> NAME_MAX */

glibc supports many more flags. SuS of course has to restrict itself,
there are not that many flags which are portable and available on all
the platforms. Look at /usr/include/bits/statvfs.h for what has to be
supported and the values to use. If the values the kernel will use
differ I'd have to (unnecessarily) convert the values. If some values
are missing/not supported I still would have to use /proc/mounts and
nothing is gained.

2010-06-25 18:16:42

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On Fri, Jun 25, 2010 at 10:52:05AM -0700, Ulrich Drepper wrote:
> there are not that many flags which are portable and available on all
> the platforms. Look at /usr/include/bits/statvfs.h for what has to be
> supported and the values to use. If the values the kernel will use
> differ I'd have to (unnecessarily) convert the values. If some values
> are missing/not supported I still would have to use /proc/mounts and
> nothing is gained.

I don't quite get what ST_WRITE is supposed to mean. All but that one
can be supported trivially.

2010-06-25 18:45:10

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On Fri, Jun 25, 2010 at 02:16:38PM -0400, Christoph Hellwig wrote:
> On Fri, Jun 25, 2010 at 10:52:05AM -0700, Ulrich Drepper wrote:
> > there are not that many flags which are portable and available on all
> > the platforms. Look at /usr/include/bits/statvfs.h for what has to be
> > supported and the values to use. If the values the kernel will use
> > differ I'd have to (unnecessarily) convert the values. If some values
> > are missing/not supported I still would have to use /proc/mounts and
> > nothing is gained.
>
> I don't quite get what ST_WRITE is supposed to mean. All but that one
> can be supported trivially.

In addition ST_APPEND and ST_IMMUTABLE are rather puzzling. Do you
really want these to mean if the file we call statfs on have the
immutable/append only bits set? That is mixing two bits of stat
information into statfs?

2010-06-25 19:40:49

by Ulrich Drepper

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On Fri, Jun 25, 2010 at 11:45, Christoph Hellwig <[email protected]> wrote:
> I don't quite get what ST_WRITE is supposed to mean. All but that one
> can be supported trivially.

ST_WRITE comes elsewhere. We don't use it on Linux.

> In addition ST_APPEND and ST_IMMUTABLE are rather puzzling. Do you
> really want these to mean if the file we call statfs on have the
> immutable/append only bits set? That is mixing two bits of stat
> information into statfs?

Ignore these as well, they also has a different source.

2010-06-26 05:54:11

by J. R. Okajima

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

Nick Piggin:
> Is there anything more we should add here? Samba wants a capabilities
> field, with things like sparse files, quotas, compression, encryption,
> case preserving/sensitive.

How about the max link count?
There was a post in last December.
See <http://marc.info/?l=linux-kernel&m=126008640210762&w=2> and its
thread in detail.

J. R. Okajima

----------------------------------------------------------------------
The pathconf(_PC_LINK_MAX) cannot get the correct value, since linux
kernel doesn't provide such interface. And the current implementation in
GLibc issues statfs(2) first and then returns the predefined value
(EXT2_LINK_MAX, etc) based upoin the filesystem type. But GLibc doesn't
support all filesystem types. ie. when the target filesystem is unknown
to pathconf(3), it will return LINUX_LINK_MAX (127).
For GLibc, there is no way except implementing this poor method.

This patch makes statfs(2) return the correct value via struct
statfs.f_spare[0].

RFC:
- Can we use f_spare for this purpose?
- Does pathconf(_PC_LINK_MAX) distinguish a dir and a non-dir?
If a filesystem sets different limit for a dir as a link count from a
non-dir, then should the filesystem checks the type of the specified
dentry->d_inode->i_mode and return the different value?
This patch series doesn't distinguish them and return a single value.
- Here I tried supporting only ext[23], nfs and tmpfs. Since I can test
them by myself. I left other FSs as it is, which means if FS doesn't
support _PC_LINK_MAX by modifying its s_op->statfs(), the default
value will be returned. The default value is taken from GLibc trying
to keep the compatibility. But it may not be important.
- Some FS such as ms-dos based one which doesn't support hardlink, will
return LINK_MAX_UNSUPPORTED which is defined as 1.
- Other FS such as tmpfs which doesn't check the link count in link(2),
will return LINK_MAX_UNLIMITED which is defined as -1. This value
doesn't mean an error. The negative return value of pathconf(3) is
valid.

Even if linux kernel return a correct value via statfs(2) (or anything
else), users will not get the value at once since the support in libc is
necessary too.

J. R. Okajima (5):
vfs, support pathconf(3) with _PC_LINK_MAX
ext2, support pathconf(3) with _PC_LINK_MAX
ext3, support pathconf(3) with _PC_LINK_MAX
nfs, support pathconf(3) with _PC_LINK_MAX
tmpfs, support pathconf(3) with _PC_LINK_MAX

fs/compat.c | 5 +++--
fs/ext2/super.c | 1 +
fs/ext3/super.c | 1 +
fs/libfs.c | 1 +
fs/nfs/client.c | 10 +++++++---
fs/nfs/super.c | 1 +
fs/open.c | 9 +++++++--
include/linux/nfs_fs_sb.h | 1 +
include/linux/statfs.h | 6 ++++++
mm/shmem.c | 1 +
10 files changed, 29 insertions(+), 7 deletions(-)

2010-06-26 09:35:50

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On Sat, Jun 26, 2010 at 02:53:32PM +0900, J. R. Okajima wrote:
>
> Nick Piggin:
> > Is there anything more we should add here? Samba wants a capabilities
> > field, with things like sparse files, quotas, compression, encryption,
> > case preserving/sensitive.
>
> How about the max link count?
> There was a post in last December.
> See <http://marc.info/?l=linux-kernel&m=126008640210762&w=2> and its
> thread in detail.

That's really job for a pathconf system call that allows quering random
paramters.

2010-06-26 10:13:42

by Andi Kleen

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

Nick Piggin <[email protected]> writes:

> Other than types, other differences are:
> - statvfs(2) has is f_frsize, which seems fairly useless.
> - statvfs(2) has f_favail.
> - statfs(2) f_bsize is optimal transfer block, statvfs(2) f_bsize is fs
> block size. The latter could be useful for disk space algorithms.
> Both can be ill defned.
> - statvfs(2) lacks f_type.
>
> Is there anything more we should add here? Samba wants a capabilities
> field, with things like sparse files, quotas, compression, encryption,
> case preserving/sensitive.

I wonder if it would make sense to export the time stamp granuality
of the time stamps? We already have this information internally,
and it might allow user land to optimize its stat frequency or comparison.

Some file systems also have quotas with "project ids". Maybe add that
too?

I think NTFS et.al. also have some more time stamps, but not sure
there's enough space for that.

-Andi

--
[email protected] -- Speaking for myself only.

2010-06-26 12:55:23

by J. R. Okajima

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

Christoph Hellwig:
> That's really job for a pathconf system call that allows quering random
> paramters.

Do you mean it should be implemented such like this?
vfs_pathconf(struct dentry, int parm)
--> return d_sb->s_op->pathconf(parm)

I am afraid it is overdesign because the actual parameter(for FS) is
_PC_LINK_MAX only. All other params are already handled by VFS, glibc or
sb->statfs.

J. R. Okajima

(pathconf(3) parameters from the manual)
_PC_LINK_MAX
returns the maximum number of links to the file. If fd or path refer to a direc-
tory, then the value applies to the whole directory. The corresponding macro is
_POSIX_LINK_MAX.

_PC_MAX_CANON
returns the maximum length of a formatted input line, where fd or path must refer
to a terminal. The corresponding macro is _POSIX_MAX_CANON.

_PC_MAX_INPUT
returns the maximum length of an input line, where fd or path must refer to a ter-
minal. The corresponding macro is _POSIX_MAX_INPUT.

_PC_NAME_MAX
returns the maximum length of a filename in the directory path or fd that the pro-
cess is allowed to create. The corresponding macro is _POSIX_NAME_MAX.

_PC_PATH_MAX
returns the maximum length of a relative pathname when path or fd is the current
working directory. The corresponding macro is _POSIX_PATH_MAX.

_PC_PIPE_BUF
returns the size of the pipe buffer, where fd must refer to a pipe or FIFO and path
must refer to a FIFO. The corresponding macro is _POSIX_PIPE_BUF.

_PC_CHOWN_RESTRICTED
returns non-zero if the chown(2) call may not be used on this file. If fd or path
refer to a directory, then this applies to all files in that directory. The corre-
sponding macro is _POSIX_CHOWN_RESTRICTED.

_PC_NO_TRUNC
returns non-zero if accessing filenames longer than _POSIX_NAME_MAX generates an
error. The corresponding macro is _POSIX_NO_TRUNC.

_PC_VDISABLE
returns non-zero if special character processing can be disabled, where fd or path
must refer to a terminal.

2010-06-26 14:49:40

by Ulrich Drepper

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On Sat, Jun 26, 2010 at 02:35, Christoph Hellwig <[email protected]> wrote:
> That's really job for a pathconf system call that allows quering random
> paramters.

Linus has always objected to sysconf/pathconf-like syscalls. If you
get it in I'm all for it.

2010-07-05 21:41:36

by Brad Boyer

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On Sat, Jun 26, 2010 at 09:54:44PM +0900, J. R. Okajima wrote:
> Christoph Hellwig:
> > That's really job for a pathconf system call that allows quering random
> > paramters.
>
> Do you mean it should be implemented such like this?
> vfs_pathconf(struct dentry, int parm)
> --> return d_sb->s_op->pathconf(parm)

I would suggest making it an inode operation if we do actually add it. Most
cases are going to be per super-block, but it might be easier to transparently
handle things like _PC_PIPE_BUF in glibc if it could call an fpathconf type
system call on the pipe fd. I haven't looked at the current glibc code for
that particular selector. The only one I looked at in any detail was
_PC_LINK_MAX, which is the one you already discussed and is obviously a
per-sb option. The only drawback I can see is that making it an inode
operation would make the vfs_pathconf fail on a negative dentry, but that
seems like a very strange thing to support in any case.

> I am afraid it is overdesign because the actual parameter(for FS) is
> _PC_LINK_MAX only. All other params are already handled by VFS, glibc or
> sb->statfs.

Brad Boyer
[email protected]

2010-07-05 23:32:26

by J. R. Okajima

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

Brad Boyer:
> I would suggest making it an inode operation if we do actually add it. Most
> cases are going to be per super-block, but it might be easier to transparently
> handle things like _PC_PIPE_BUF in glibc if it could call an fpathconf type
> system call on the pipe fd. I haven't looked at the current glibc code for
> that particular selector. The only one I looked at in any detail was
> _PC_LINK_MAX, which is the one you already discussed and is obviously a
> per-sb option. The only drawback I can see is that making it an inode
> operation would make the vfs_pathconf fail on a negative dentry, but that
> seems like a very strange thing to support in any case.

Recently the size of the pipe buffer becomes customizable, doesn't it?
For _PC_PIPE_BUF, fpathconf should issue fcntl(F_GETPIPE_SZ).

For negative dentry, it should be supported as long as some
standard/specification doesn't prohibit explicitly. So I still think
statfs is the best place to implement _PC_LINK_MAX.

J. R. Okajima

2010-07-06 00:46:26

by Brad Boyer

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On Tue, Jul 06, 2010 at 08:31:30AM +0900, J. R. Okajima wrote:
> Recently the size of the pipe buffer becomes customizable, doesn't it?
> For _PC_PIPE_BUF, fpathconf should issue fcntl(F_GETPIPE_SZ).

That should work and is in line with my understanding of the current
code for pathconf in glibc.

> For negative dentry, it should be supported as long as some
> standard/specification doesn't prohibit explicitly. So I still think
> statfs is the best place to implement _PC_LINK_MAX.

If we're going to be changing statfs (or adding a new system call)
anyway, that does seem like a reasonable place to export this data
along with whatever else gets added. With the various things that
have been suggested, maybe we need something more like the stat
replacement that has been getting discussed with the room for some
larger optional fields and a way to request a specific set of fields.

Brad Boyer
[email protected]

2010-07-06 16:45:32

by Linus Torvalds

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On Mon, Jul 5, 2010 at 5:45 PM, Brad Boyer <[email protected]> wrote:
> On Tue, Jul 06, 2010 at 08:31:30AM +0900, J. R. Okajima wrote:
>> For negative dentry, it should be supported as long as some
>> standard/specification doesn't prohibit explicitly. So I still think
>> statfs is the best place to implement _PC_LINK_MAX.
>
> If we're going to be changing statfs (or adding a new system call)
> anyway, that does seem like a reasonable place to export this data
> along with whatever else gets added. With the various things that
> have been suggested, maybe we need something more like the stat
> replacement that has been getting discussed with the room for some
> larger optional fields and a way to request a specific set of fields.

Let's not overdesign things. Just do something like the attached
patch, which is the obvious and straightforward thing to do.

Overdesigning is a disease. It's fundamentally wrong.

(Yeah, yeah,. the patch is untested, and doesn't actually _fill_ the
new f_flags value, but that's left as a trivial exercise for the
reader.)

Linus

Attachments:

diff (5.08 kB)

2010-07-07 01:44:50

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On Tue, Jul 06, 2010 at 09:45:26AM -0700, Linus Torvalds wrote:
> Let's not overdesign things. Just do something like the attached
> patch, which is the obvious and straightforward thing to do.
>
> Overdesigning is a disease. It's fundamentally wrong.
>
> (Yeah, yeah,. the patch is untested, and doesn't actually _fill_ the
> new f_flags value, but that's left as a trivial exercise for the
> reader.)

At least one of the readers posted a patch filling it in already.
Need to send out the version with the review comments addressed, but
I'm still waiting for Uli if he really insists on new syscall vectors
for the same structure. Using that one ST_VALID bit seems a lot easier
to me.

2010-07-07 02:29:18

by Linus Torvalds

[permalink] [raw]

Subject: Re: [rfc] new stat*fs-like syscall?

On Tue, Jul 6, 2010 at 6:44 PM, Christoph Hellwig <[email protected]> wrote:
>
> I'm still waiting for Uli if he really insists on new syscall vectors
> for the same structure. ?Using that one ST_VALID bit seems a lot easier
> to me.

Umm. Uli doesn't get to choose kernel system call conventions. It
matters not one whit whether he insists on new system calls or not,
it's not going to happen.

Linus