LinuxLists.cc - [GIT PULL] Filesystem Information

[permalink] [raw]

Subject: Re: [GIT PULL] Filesystem Information

On Mon, Aug 3, 2020 at 5:50 PM David Howells <[email protected]> wrote:
>
>
> Hi Linus,
>
> Here's a set of patches that adds a system call, fsinfo(), that allows
> information about the VFS, mount topology, superblock and files to be
> retrieved.
>
> The patchset is based on top of the mount notifications patchset so that
> the mount notification mechanism can be hooked to provide event counters
> that can be retrieved with fsinfo(), thereby making it a lot faster to work
> out which mounts have changed.
>
> Note that there was a last minute change requested by Miklós: the event
> counter bits got moved from the mount notification patchset to this one.
> The counters got made atomic_long_t inside the kernel and __u64 in the
> UAPI. The aggregate changes can be assessed by comparing pre-change tag,
> fsinfo-core-20200724 to the requested pull tag.
>
> Karel Zak has created preliminary patches that add support to libmount[*]
> and Ian Kent has started working on making systemd use these and mount
> notifications[**].

So why are you asking to pull at this stage?

Has anyone done a review of the patchset?

I think it's obvious that this API needs more work. The integration
work done by Ian is a good direction, but it's not quite the full
validation and review that a complex new API needs.

At least that's my opinion.

Thanks,
Miklos

2020-08-04 02:16:30

[permalink] [raw]

Subject: Re: [GIT PULL] Filesystem Information

On Mon, 2020-08-03 at 18:42 +0200, Miklos Szeredi wrote:
> On Mon, Aug 3, 2020 at 5:50 PM David Howells <[email protected]>
> wrote:
> >
> > Hi Linus,
> >
> > Here's a set of patches that adds a system call, fsinfo(), that
> > allows
> > information about the VFS, mount topology, superblock and files to
> > be
> > retrieved.
> >
> > The patchset is based on top of the mount notifications patchset so
> > that
> > the mount notification mechanism can be hooked to provide event
> > counters
> > that can be retrieved with fsinfo(), thereby making it a lot faster
> > to work
> > out which mounts have changed.
> >
> > Note that there was a last minute change requested by Miklós: the
> > event
> > counter bits got moved from the mount notification patchset to this
> > one.
> > The counters got made atomic_long_t inside the kernel and __u64 in
> > the
> > UAPI. The aggregate changes can be assessed by comparing pre-
> > change tag,
> > fsinfo-core-20200724 to the requested pull tag.
> >
> > Karel Zak has created preliminary patches that add support to
> > libmount[*]
> > and Ian Kent has started working on making systemd use these and
> > mount
> > notifications[**].
>
> So why are you asking to pull at this stage?
>
> Has anyone done a review of the patchset?

I have been working with the patch set as it has evolved for quite a
while now.

I've been reading the kernel code quite a bit and forwarded questions
and minor changes to David as they arose.

As for a review, not specifically, but while the series implements a
rather large change it's surprisingly straight forward to read.

In the time I have been working with it I haven't noticed any problems
except for those few minor things that I reported to David early on (in
some cases accompanied by simple patches).

And more recently (obviously) I've been working with the mount
notifications changes and, from a readability POV, I find it's the
same as the fsinfo() code.

>
> I think it's obvious that this API needs more work. The integration
> work done by Ian is a good direction, but it's not quite the full
> validation and review that a complex new API needs.

Maybe but the system call is fundamental to making notifications useful
and, as I say, after working with it for quite a while I don't fell
there's missing features (that David hasn't added along the way) and
have found it provides what's needed for what I'm doing (for mount
notifications at least).

I'll be posting a github PR for systemd for discussion soon while I
get on with completing the systemd change. Like overflow handling and
meson build system changes to allow building with and without the
util-linux libmount changes.

So, ideally, I'd like to see the series merged, we've been working on
it for quite a considerable time now.

Ian

2020-08-04 14:40:38

[permalink] [raw]

Subject: Re: [GIT PULL] Filesystem Information

On Tue, Aug 4, 2020 at 4:15 AM Ian Kent <[email protected]> wrote:
>
> On Mon, 2020-08-03 at 18:42 +0200, Miklos Szeredi wrote:
> > On Mon, Aug 3, 2020 at 5:50 PM David Howells <[email protected]>
> > wrote:
> > >
> > > Hi Linus,
> > >
> > > Here's a set of patches that adds a system call, fsinfo(), that
> > > allows
> > > information about the VFS, mount topology, superblock and files to
> > > be
> > > retrieved.
> > >
> > > The patchset is based on top of the mount notifications patchset so
> > > that
> > > the mount notification mechanism can be hooked to provide event
> > > counters
> > > that can be retrieved with fsinfo(), thereby making it a lot faster
> > > to work
> > > out which mounts have changed.
> > >
> > > Note that there was a last minute change requested by Miklós: the
> > > event
> > > counter bits got moved from the mount notification patchset to this
> > > one.
> > > The counters got made atomic_long_t inside the kernel and __u64 in
> > > the
> > > UAPI. The aggregate changes can be assessed by comparing pre-
> > > change tag,
> > > fsinfo-core-20200724 to the requested pull tag.
> > >
> > > Karel Zak has created preliminary patches that add support to
> > > libmount[*]
> > > and Ian Kent has started working on making systemd use these and
> > > mount
> > > notifications[**].
> >
> > So why are you asking to pull at this stage?
> >
> > Has anyone done a review of the patchset?
>
> I have been working with the patch set as it has evolved for quite a
> while now.
>
> I've been reading the kernel code quite a bit and forwarded questions
> and minor changes to David as they arose.
>
> As for a review, not specifically, but while the series implements a
> rather large change it's surprisingly straight forward to read.
>
> In the time I have been working with it I haven't noticed any problems
> except for those few minor things that I reported to David early on (in
> some cases accompanied by simple patches).
>
> And more recently (obviously) I've been working with the mount
> notifications changes and, from a readability POV, I find it's the
> same as the fsinfo() code.
>
> >
> > I think it's obvious that this API needs more work. The integration
> > work done by Ian is a good direction, but it's not quite the full
> > validation and review that a complex new API needs.
>
> Maybe but the system call is fundamental to making notifications useful
> and, as I say, after working with it for quite a while I don't fell
> there's missing features (that David hasn't added along the way) and
> have found it provides what's needed for what I'm doing (for mount
> notifications at least).

Apart from the various issues related to the various mount ID's and
their sizes, my general comment is (and was always): why are we adding
a multiplexer for retrieval of mostly unrelated binary structures?

<linux/fsinfo.h> is 345 lines. This is not a simple and clean API.

A simple and clean replacement API would be:

int get_mount_attribute(int dfd, const char *path, const char
*attr_name, char *value_buf, size_t buf_size, int flags);

No header file needed with dubiously sized binary values.

The only argument was performance, but apart from purely synthetic
microbenchmarks that hasn't been proven to be an issue.

And notice how similar the above interface is to getxattr(), or the
proposed readfile(). Where has the "everything is a file" philosophy
gone?

I think we already lost that with the xattr API, that should have been
done in a way that fits this philosophy. But given that we have "/"
as the only special purpose char in filenames, and even repetitions
are allowed, it's hard to think of a good way to do that. Pity.

Still I think it would be nice to have a general purpose attribute
retrieval API instead of the multiplicity of binary ioctls, xattrs,
etc.

Is that totally crazy? Nobody missing the beauty in recently introduced APIs?

Thanks,
Miklos

2020-08-05 01:34:51

[permalink] [raw]

Subject: Re: [GIT PULL] Filesystem Information

On Tue, 2020-08-04 at 16:36 +0200, Miklos Szeredi wrote:
> On Tue, Aug 4, 2020 at 4:15 AM Ian Kent <[email protected]> wrote:
> > On Mon, 2020-08-03 at 18:42 +0200, Miklos Szeredi wrote:
> > > On Mon, Aug 3, 2020 at 5:50 PM David Howells <[email protected]
> > > >
> > > wrote:
> > > > Hi Linus,
> > > >
> > > > Here's a set of patches that adds a system call, fsinfo(), that
> > > > allows
> > > > information about the VFS, mount topology, superblock and files
> > > > to
> > > > be
> > > > retrieved.
> > > >
> > > > The patchset is based on top of the mount notifications
> > > > patchset so
> > > > that
> > > > the mount notification mechanism can be hooked to provide event
> > > > counters
> > > > that can be retrieved with fsinfo(), thereby making it a lot
> > > > faster
> > > > to work
> > > > out which mounts have changed.
> > > >
> > > > Note that there was a last minute change requested by Miklós:
> > > > the
> > > > event
> > > > counter bits got moved from the mount notification patchset to
> > > > this
> > > > one.
> > > > The counters got made atomic_long_t inside the kernel and __u64
> > > > in
> > > > the
> > > > UAPI. The aggregate changes can be assessed by comparing pre-
> > > > change tag,
> > > > fsinfo-core-20200724 to the requested pull tag.
> > > >
> > > > Karel Zak has created preliminary patches that add support to
> > > > libmount[*]
> > > > and Ian Kent has started working on making systemd use these
> > > > and
> > > > mount
> > > > notifications[**].
> > >
> > > So why are you asking to pull at this stage?
> > >
> > > Has anyone done a review of the patchset?
> >
> > I have been working with the patch set as it has evolved for quite
> > a
> > while now.
> >
> > I've been reading the kernel code quite a bit and forwarded
> > questions
> > and minor changes to David as they arose.
> >
> > As for a review, not specifically, but while the series implements
> > a
> > rather large change it's surprisingly straight forward to read.
> >
> > In the time I have been working with it I haven't noticed any
> > problems
> > except for those few minor things that I reported to David early on
> > (in
> > some cases accompanied by simple patches).
> >
> > And more recently (obviously) I've been working with the mount
> > notifications changes and, from a readability POV, I find it's the
> > same as the fsinfo() code.
> >
> > > I think it's obvious that this API needs more work. The
> > > integration
> > > work done by Ian is a good direction, but it's not quite the full
> > > validation and review that a complex new API needs.
> >
> > Maybe but the system call is fundamental to making notifications
> > useful
> > and, as I say, after working with it for quite a while I don't fell
> > there's missing features (that David hasn't added along the way)
> > and
> > have found it provides what's needed for what I'm doing (for mount
> > notifications at least).
>
> Apart from the various issues related to the various mount ID's and
> their sizes, my general comment is (and was always): why are we
> adding
> a multiplexer for retrieval of mostly unrelated binary structures?
>
> <linux/fsinfo.h> is 345 lines. This is not a simple and clean API.
>
> A simple and clean replacement API would be:
>
> int get_mount_attribute(int dfd, const char *path, const char
> *attr_name, char *value_buf, size_t buf_size, int flags);
>
> No header file needed with dubiously sized binary values.
>
> The only argument was performance, but apart from purely synthetic
> microbenchmarks that hasn't been proven to be an issue.
>
> And notice how similar the above interface is to getxattr(), or the
> proposed readfile(). Where has the "everything is a file"
> philosophy
> gone?

Maybe, but that philosophy (in a roundabout way) is what's resulted
in some of the problems we now have. Granted it's blind application
of that philosophy rather than the philosophy itself but that is
what happens.

I get that your comments are driven by the way that philosophy should
be applied which is more of a "if it works best doing it that way then
do it that way, and that's usually a file".

In this case there is a logical division of various types of file
system information and the underlying suggestion is maybe it's time
to move away from the "everything is a file" hard and fast rule,
and get rid of some of the problems that have resulted from it.

The notifications is an example, yes, the delivery mechanism is
a "file" but the design of the queueing mechanism makes a lot of
sense for the throughput that's going to be needed as time marches
on. Then there's different sub-systems each with unique information
that needs to be deliverable some other way because delivering "all"
the information via the notification would be just plain wrong so
a multi-faceted information delivery mechanism makes the most
sense to allow specific targeted retrieval of individual items of
information.

But that also supposes your at least open to the idea that "maybe
not everything should be a file".

>
> I think we already lost that with the xattr API, that should have
> been
> done in a way that fits this philosophy. But given that we have "/"
> as the only special purpose char in filenames, and even repetitions
> are allowed, it's hard to think of a good way to do that. Pity.
>
> Still I think it would be nice to have a general purpose attribute
> retrieval API instead of the multiplicity of binary ioctls, xattrs,
> etc.
>
> Is that totally crazy? Nobody missing the beauty in recently
> introduced APIs?
>
> Thanks,
> Miklos

2020-08-05 08:05:57

[permalink] [raw]

Subject: Re: [GIT PULL] Filesystem Information

On Wed, Aug 5, 2020 at 3:33 AM Ian Kent <[email protected]> wrote:
>

> On Tue, 2020-08-04 at 16:36 +0200, Miklos Szeredi wrote:
> > And notice how similar the above interface is to getxattr(), or the
> > proposed readfile(). Where has the "everything is a file"
> > philosophy
> > gone?
>
> Maybe, but that philosophy (in a roundabout way) is what's resulted
> in some of the problems we now have. Granted it's blind application
> of that philosophy rather than the philosophy itself but that is
> what happens.

Agree. What people don't seem to realize, even though there are
blindingly obvious examples, that binary interfaces like the proposed
fsinfo(2) syscall can also result in a multitude of problems at the
same time as solving some others.

There's no magic solution in API design, it's not balck and white.
We just need to strive for a good enough solution. The problem seems
to be that trying to discuss the merits of other approaches seems to
hit a brick wall. We just see repeated pull requests from David,
without any real discussion of the proposed alternatives.

>
> I get that your comments are driven by the way that philosophy should
> be applied which is more of a "if it works best doing it that way then
> do it that way, and that's usually a file".
>
> In this case there is a logical division of various types of file
> system information and the underlying suggestion is maybe it's time
> to move away from the "everything is a file" hard and fast rule,
> and get rid of some of the problems that have resulted from it.
>
> The notifications is an example, yes, the delivery mechanism is
> a "file" but the design of the queueing mechanism makes a lot of
> sense for the throughput that's going to be needed as time marches
> on. Then there's different sub-systems each with unique information
> that needs to be deliverable some other way because delivering "all"
> the information via the notification would be just plain wrong so
> a multi-faceted information delivery mechanism makes the most
> sense to allow specific targeted retrieval of individual items of
> information.
>
> But that also supposes your at least open to the idea that "maybe
> not everything should be a file".

Sure. I've learned pragmatism, although idealist at heart. And I'm
not saying all API's from David are shit. statx(2) is doing fine.
It's a simple binary interface that does its job well. Compare the
header files for statx and fsinfo, though, and maybe you'll see what
I'm getting at...

Thanks,
Miklos

2020-08-05 08:25:33

[permalink] [raw]

Subject: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 4, 2020 at 4:36 PM Miklos Szeredi <[email protected]> wrote:

> I think we already lost that with the xattr API, that should have been
> done in a way that fits this philosophy. But given that we have "/"
> as the only special purpose char in filenames, and even repetitions
> are allowed, it's hard to think of a good way to do that. Pity.

One way this could be solved is to allow opting into an alternative
path resolution mode.

E.g.
openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT);

Yes, the implementation might be somewhat tricky, but that's another
question. Also I'm pretty sure that we should be reducing the
POSIX-ness of anything below "//" to the bare minimum. No seeking,
etc....

I think this would open up some nice possibilities beyond the fsinfo thing.

Thanks,
Miklos

2020-08-05 20:16:59

[permalink] [raw]

Subject: Re: [GIT PULL] Filesystem Information

On Wed, 2020-08-05 at 10:00 +0200, Miklos Szeredi wrote:
> On Wed, Aug 5, 2020 at 3:33 AM Ian Kent <[email protected]> wrote:
> > On Tue, 2020-08-04 at 16:36 +0200, Miklos Szeredi wrote:
> > > And notice how similar the above interface is to getxattr(), or
> > > the
> > > proposed readfile(). Where has the "everything is a file"
> > > philosophy
> > > gone?
> >
> > Maybe, but that philosophy (in a roundabout way) is what's resulted
> > in some of the problems we now have. Granted it's blind application
> > of that philosophy rather than the philosophy itself but that is
> > what happens.
>
> Agree. What people don't seem to realize, even though there are
> blindingly obvious examples, that binary interfaces like the proposed
> fsinfo(2) syscall can also result in a multitude of problems at the
> same time as solving some others.
>
> There's no magic solution in API design, it's not balck and white.
> We just need to strive for a good enough solution. The problem seems
> to be that trying to discuss the merits of other approaches seems to
> hit a brick wall. We just see repeated pull requests from David,
> without any real discussion of the proposed alternatives.
>
> > I get that your comments are driven by the way that philosophy
> > should
> > be applied which is more of a "if it works best doing it that way
> > then
> > do it that way, and that's usually a file".
> >
> > In this case there is a logical division of various types of file
> > system information and the underlying suggestion is maybe it's time
> > to move away from the "everything is a file" hard and fast rule,
> > and get rid of some of the problems that have resulted from it.
> >
> > The notifications is an example, yes, the delivery mechanism is
> > a "file" but the design of the queueing mechanism makes a lot of
> > sense for the throughput that's going to be needed as time marches
> > on. Then there's different sub-systems each with unique information
> > that needs to be deliverable some other way because delivering
> > "all"
> > the information via the notification would be just plain wrong so
> > a multi-faceted information delivery mechanism makes the most
> > sense to allow specific targeted retrieval of individual items of
> > information.
> >
> > But that also supposes your at least open to the idea that "maybe
> > not everything should be a file".
>
> Sure. I've learned pragmatism, although idealist at heart. And I'm
> not saying all API's from David are shit. statx(2) is doing fine.
> It's a simple binary interface that does its job well. Compare the
> header files for statx and fsinfo, though, and maybe you'll see what
> I'm getting at...

Yeah, but I'm biased so not much joy there ... ;)

Ian

2020-08-11 13:56:46

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Wed, Aug 05, 2020 at 10:24:23AM +0200, Miklos Szeredi wrote:
> On Tue, Aug 4, 2020 at 4:36 PM Miklos Szeredi <[email protected]> wrote:
>
> > I think we already lost that with the xattr API, that should have been
> > done in a way that fits this philosophy. But given that we have "/"
> > as the only special purpose char in filenames, and even repetitions
> > are allowed, it's hard to think of a good way to do that. Pity.
>
> One way this could be solved is to allow opting into an alternative
> path resolution mode.
>
> E.g.
> openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT);

Proof of concept patch and test program below.

Opted for triple slash in the hope that just maybe we could add a global
/proc/sys/fs/resolve_alt knob to optionally turn on alternative (non-POSIX) path
resolution without breaking too many things. Will try that later...

Comments?

Thanks,
Miklos

cat_alt.c:
-------- >8 --------
#define _GNU_SOURCE
#include <err.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <linux/unistd.h>
#include <linux/openat2.h>

#define RESOLVE_ALT 0x20 /* Alternative path walk mode where
multiple slashes have special meaning */

int main(int argc, char *argv[])
{
struct open_how how = {
.flags = O_RDONLY,
.resolve = RESOLVE_ALT,
};
int fd, res, i;
char buf[65536], *end;
const char *path = argv[1];
int dfd = AT_FDCWD;

if (argc < 2 || argc > 4)
errx(1, "usage: %s path [dirfd] [--nofollow]", argv[0]);

for (i = 2; i < argc; i++) {
if (strcmp(argv[i], "--nofollow") == 0) {
how.flags |= O_NOFOLLOW;
} else {
dfd = strtoul(argv[i], &end, 0);
if (end == argv[i] || *end)
errx(1, "invalid dirfd: %s", argv[i]);
}
}

fd = syscall(__NR_openat2, dfd, path, &how, sizeof(how));
if (fd == -1)
err(1, "failed to open %s", argv[1]);

while (1) {
res = read(fd, buf, sizeof(buf));
if (res == -1)
err(1, "failed to read file");
if (res == 0)
break;

write(1, buf, res);
}
close(fd);
return 0;
}
-------- >8 --------

---
fs/Makefile | 2
fs/file_table.c | 70 ++++++++++++++--------
fs/fsmeta.c | 135 +++++++++++++++++++++++++++++++++++++++++++
fs/internal.h | 9 ++
fs/mount.h | 4 +
fs/namei.c | 77 +++++++++++++++++++++---
fs/namespace.c | 12 +++
fs/open.c | 2
fs/proc_namespace.c | 2
include/linux/fcntl.h | 2
include/linux/namei.h | 3
include/uapi/linux/magic.h | 1
include/uapi/linux/openat2.h | 2
13 files changed, 282 insertions(+), 39 deletions(-)

--- a/fs/Makefile
+++ b/fs/Makefile
@@ -13,7 +13,7 @@ obj-y := open.o read_write.o file_table.
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o splice.o sync.o utimes.o d_path.o \
stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
- fs_types.o fs_context.o fs_parser.o fsopen.o
+ fs_types.o fs_context.o fs_parser.o fsopen.o fsmeta.o \

ifeq ($(CONFIG_BLOCK),y)
obj-y += buffer.o block_dev.o direct-io.o mpage.o
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -178,22 +178,9 @@ struct file *alloc_empty_file_noaccount(
return f;
}

-/**
- * alloc_file - allocate and initialize a 'struct file'
- *
- * @path: the (dentry, vfsmount) pair for the new file
- * @flags: O_... flags with which the new file will be opened
- * @fop: the 'struct file_operations' for the new file
- */
-static struct file *alloc_file(const struct path *path, int flags,
- const struct file_operations *fop)
+static void init_file(struct file *file, const struct path *path, int flags,
+ const struct file_operations *fop)
{
- struct file *file;
-
- file = alloc_empty_file(flags, current_cred());
- if (IS_ERR(file))
- return file;
-
file->f_path = *path;
file->f_inode = path->dentry->d_inode;
file->f_mapping = path->dentry->d_inode->i_mapping;
@@ -209,31 +196,66 @@ static struct file *alloc_file(const str
file->f_op = fop;
if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
i_readcount_inc(path->dentry->d_inode);
+}
+
+/**
+ * alloc_file - allocate and initialize a 'struct file'
+ *
+ * @path: the (dentry, vfsmount) pair for the new file
+ * @flags: O_... flags with which the new file will be opened
+ * @fop: the 'struct file_operations' for the new file
+ */
+static struct file *alloc_file(const struct path *path, int flags,
+ const struct file_operations *fop)
+{
+ struct file *file;
+
+ file = alloc_empty_file(flags, current_cred());
+ if (IS_ERR(file))
+ return file;
+
+ init_file(file, path, flags, fop);
+
return file;
}

-struct file *alloc_file_pseudo(struct inode *inode, struct vfsmount *mnt,
- const char *name, int flags,
- const struct file_operations *fops)
+int init_file_pseudo(struct file *file, struct inode *inode,
+ struct vfsmount *mnt, const char *name, int flags,
+ const struct file_operations *fops)
{
static const struct dentry_operations anon_ops = {
.d_dname = simple_dname
};
struct qstr this = QSTR_INIT(name, strlen(name));
struct path path;
- struct file *file;

path.dentry = d_alloc_pseudo(mnt->mnt_sb, &this);
if (!path.dentry)
- return ERR_PTR(-ENOMEM);
+ return -ENOMEM;
if (!mnt->mnt_sb->s_d_op)
d_set_d_op(path.dentry, &anon_ops);
path.mnt = mntget(mnt);
d_instantiate(path.dentry, inode);
- file = alloc_file(&path, flags, fops);
- if (IS_ERR(file)) {
- ihold(inode);
- path_put(&path);
+ init_file(file, &path, flags, fops);
+
+ return 0;
+}
+
+struct file *alloc_file_pseudo(struct inode *inode, struct vfsmount *mnt,
+ const char *name, int flags,
+ const struct file_operations *fops)
+{
+ struct file *file;
+ int err;
+
+ file = alloc_empty_file(flags, current_cred());
+ if (IS_ERR(file))
+ return file;
+
+ err = init_file_pseudo(file, inode, mnt, name, flags, fops);
+ if (err) {
+ fput(file);
+ file = ERR_PTR(err);
}
return file;
}
--- /dev/null
+++ b/fs/fsmeta.c
@@ -0,0 +1,135 @@
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/magic.h>
+#include <linux/seq_file.h>
+#include <linux/fs_struct.h>
+#include <linux/pseudo_fs.h>
+
+#include "mount.h"
+#include "internal.h"
+
+static struct vfsmount *fsmeta_mnt;
+static struct inode *fsmeta_inode;
+
+
+static struct vfsmount *fsmeta_mnt_info_get_mnt(struct seq_file *seq)
+{
+ struct proc_mounts *p = seq->private;
+
+ return &list_entry(p->cursor.mnt_list.next, struct mount, mnt_list)->mnt;
+}
+
+static void *fsmeta_mnt_info_start(struct seq_file *seq, loff_t *pos)
+{
+ mnt_namespace_lock_read();
+ return *pos == 0 ? fsmeta_mnt_info_get_mnt(seq) : NULL;
+}
+
+static void *fsmeta_mnt_info_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ ++*pos;
+ return NULL;
+}
+
+static void fsmeta_mnt_info_stop(struct seq_file *seq, void *v)
+{
+ mnt_namespace_unlock_read();
+}
+
+static int fsmeta_mnt_info_show(struct seq_file *seq, void *v)
+{
+ return show_mountinfo(seq, v);
+}
+
+static const struct seq_operations fsmeta_mnt_info_sops = {
+ .start = fsmeta_mnt_info_start,
+ .next = fsmeta_mnt_info_next,
+ .stop = fsmeta_mnt_info_stop,
+ .show = fsmeta_mnt_info_show,
+};
+
+static int fsmeta_mnt_info_release(struct inode *inode, struct file *file)
+{
+ if (file->private_data) {
+ struct seq_file *seq = file->private_data;
+ struct proc_mounts *p = seq->private;
+
+ mntput(fsmeta_mnt_info_get_mnt(seq));
+ path_put(&p->root);
+
+ return seq_release_private(inode, file);
+ }
+ return 0;
+}
+
+static const struct file_operations fsmeta_mnt_info_fops = {
+ .release = fsmeta_mnt_info_release,
+ .read = seq_read,
+ .llseek = no_llseek,
+};
+
+static int fsmeta_mnt_info_open(struct file *file, const struct path *path,
+ const struct open_flags *op)
+{
+ struct proc_mounts *p;
+ int err;
+
+ err = init_file_pseudo(file, fsmeta_inode, fsmeta_mnt, "[mnt.info]",
+ op->open_flag, &fsmeta_mnt_info_fops);
+ if (err)
+ return err;
+ /*
+ * This reference is now sunk in file->f_path.dentry->d_inode and will
+ * be released by fput()
+ */
+ ihold(fsmeta_inode);
+
+ err = seq_open_private(file, &fsmeta_mnt_info_sops, sizeof(*p));
+ if (err)
+ return err;
+
+ p = ((struct seq_file *)file->private_data)->private;
+ get_fs_root(current->fs, &p->root);
+ p->cursor.mnt_list.next = &real_mount(mntget(path->mnt))->mnt_list;
+
+ return 0;
+}
+
+int fsmeta_open(const char *meta_name, const struct path *path,
+ struct file *file, const struct open_flags *op)
+{
+ if (op->open_flag & ~(O_LARGEFILE | O_CLOEXEC | O_NOFOLLOW))
+ return -EINVAL;
+
+ if (strcmp(meta_name, "mnt/info") == 0)
+ return fsmeta_mnt_info_open(file, path, op);
+
+ pr_info("invalid fsmeta file <%s> on %pd4\n", meta_name, path->dentry);
+ return -EINVAL;
+}
+
+static int fsmeta_init_fs_context(struct fs_context *fc)
+{
+ return init_pseudo(fc, FSMETA_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type fsmeta_fs_type = {
+ .name = "fsmeta",
+ .init_fs_context = fsmeta_init_fs_context,
+ .kill_sb = kill_anon_super,
+};
+
+static int __init fsmeta_init(void)
+{
+ fsmeta_mnt = kern_mount(&fsmeta_fs_type);
+ if (IS_ERR(fsmeta_mnt))
+ panic("fsmeta_init() kernel mount failed (%ld)\n", PTR_ERR(fsmeta_mnt));
+
+ fsmeta_inode = alloc_anon_inode(fsmeta_mnt->mnt_sb);
+ if (IS_ERR(fsmeta_inode))
+ panic("fsmeta_init() inode allocation failed (%ld)\n", PTR_ERR(fsmeta_inode));
+
+ return 0;
+}
+fs_initcall(fsmeta_init);
+
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -99,6 +99,9 @@ extern void chroot_fs_refs(const struct
*/
extern struct file *alloc_empty_file(int, const struct cred *);
extern struct file *alloc_empty_file_noaccount(int, const struct cred *);
+extern int init_file_pseudo(struct file *file, struct inode *inode,
+ struct vfsmount *mnt, const char *name, int flags,
+ const struct file_operations *fops);

/*
* super.c
@@ -185,3 +188,9 @@ int sb_init_dio_done_wq(struct super_blo
*/
int do_statx(int dfd, const char __user *filename, unsigned flags,
unsigned int mask, struct statx __user *buffer);
+
+/*
+ * fs/fsmeta.c
+ */
+int fsmeta_open(const char *meta_name, const struct path *path,
+ struct file *file, const struct open_flags *op);
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -159,3 +159,7 @@ static inline bool is_anon_ns(struct mnt
}

extern void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor);
+
+void mnt_namespace_lock_read(void);
+void mnt_namespace_unlock_read(void);
+int show_mountinfo(struct seq_file *m, struct vfsmount *mnt);
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2094,6 +2094,30 @@ static inline u64 hash_name(const void *

#endif

+static int lookup_alt(const char *name, struct nameidata *nd)
+{
+ if ((nd->flags & LOOKUP_RCU) && unlazy_walk(nd) != 0)
+ return -ECHILD;
+
+ nd->last.name = name + 3;
+ nd->last_type = LAST_META;
+
+ return 0;
+}
+
+static bool is_alt(const char *name, struct nameidata *nd, int depth)
+{
+ if (!(nd->flags & LOOKUP_ALT))
+ return false;
+
+ /* no alternative lookup inside symlinks */
+ if (depth)
+ return false;
+
+ /* name[0] has already been verified to be a slash */
+ return name[1] == '/' && name[2] == '/' && name[3] != '/';
+}
+
/*
* Name resolution.
* This is the basic name resolution function, turning a pathname into
@@ -2111,8 +2135,13 @@ static int link_path_walk(const char *na
nd->flags |= LOOKUP_PARENT;
if (IS_ERR(name))
return PTR_ERR(name);
- while (*name=='/')
- name++;
+ if (*name == '/') {
+ if (!is_alt(name, nd, depth)) {
+ do {
+ name++;
+ } while (*name == '/');
+ }
+ }
if (!*name)
return 0;

@@ -2122,6 +2151,9 @@ static int link_path_walk(const char *na
u64 hash_len;
int type;

+ if (*name == '/')
+ return lookup_alt(name, nd);
+
err = may_lookup(nd);
if (err)
return err;
@@ -2163,6 +2195,13 @@ static int link_path_walk(const char *na
* If it wasn't NUL, we know it was '/'. Skip that
* slash, and continue until no more slashes.
*/
+ if (is_alt(name, nd, depth)) {
+ link = walk_component(nd, WALK_TRAILING);
+ if (unlikely(link))
+ goto LINK;
+
+ return lookup_alt(name, nd);
+ }
do {
name++;
} while (unlikely(*name == '/'));
@@ -2183,6 +2222,7 @@ static int link_path_walk(const char *na
link = walk_component(nd, WALK_MORE);
}
if (unlikely(link)) {
+LINK:
if (IS_ERR(link))
return PTR_ERR(link);
/* a symlink to follow */
@@ -2239,11 +2279,11 @@ static const char *path_init(struct name
nd->path.dentry = NULL;

/* Absolute pathname -- fetch the root (LOOKUP_IN_ROOT uses nd->dfd). */
- if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) {
+ if (*s == '/' && !is_alt(s, nd, 0) && !(flags & LOOKUP_IN_ROOT)) {
error = nd_jump_root(nd);
if (unlikely(error))
return ERR_PTR(error);
- return s;
+ return s + 1;
}

/* Relative pathname -- get the starting-point it is relative to. */
@@ -2272,7 +2312,8 @@ static const char *path_init(struct name

dentry = f.file->f_path.dentry;

- if (*s && unlikely(!d_can_lookup(dentry))) {
+ if (*s && unlikely(!d_can_lookup(dentry)) &&
+ !is_alt(s, nd, 0)) {
fdput(f);
return ERR_PTR(-ENOTDIR);
}
@@ -2303,6 +2344,9 @@ static const char *path_init(struct name

static inline const char *lookup_last(struct nameidata *nd)
{
+ if (nd->last_type == LAST_META)
+ return ERR_PTR(-EINVAL);
+
if (nd->last_type == LAST_NORM && nd->last.name[nd->last.len])
nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;

@@ -2331,7 +2375,7 @@ static int path_lookupat(struct nameidat

while (!(err = link_path_walk(s, nd)) &&
(s = lookup_last(nd)) != NULL)
- ;
+ nd->flags &= ~LOOKUP_ALT;
if (!err)
err = complete_walk(nd);

@@ -2410,9 +2454,15 @@ static struct filename *filename_parenta
if (unlikely(retval == -ESTALE))
retval = path_parentat(&nd, flags | LOOKUP_REVAL, parent);
if (likely(!retval)) {
- *last = nd.last;
- *type = nd.last_type;
- audit_inode(name, parent->dentry, AUDIT_INODE_PARENT);
+ if (nd.last_type == LAST_META) {
+ path_put(parent);
+ putname(name);
+ name = ERR_PTR(-EINVAL);
+ } else {
+ *last = nd.last;
+ *type = nd.last_type;
+ audit_inode(name, parent->dentry, AUDIT_INODE_PARENT);
+ }
} else {
putname(name);
name = ERR_PTR(retval);
@@ -3123,6 +3173,10 @@ static const char *open_last_lookups(str
nd->flags |= op->intent;

if (nd->last_type != LAST_NORM) {
+ if (nd->last_type == LAST_META) {
+ return ERR_PTR(fsmeta_open(nd->last.name, &nd->path,
+ file, op));
+ }
if (nd->depth)
put_link(nd);
return handle_dots(nd, nd->last_type);
@@ -3206,6 +3260,9 @@ static int do_open(struct nameidata *nd,
int acc_mode;
int error;

+ if (nd->last_type == LAST_META)
+ return 0;
+
if (!(file->f_mode & (FMODE_OPENED | FMODE_CREATED))) {
error = complete_walk(nd);
if (error)
@@ -3355,7 +3412,7 @@ static struct file *path_openat(struct n
const char *s = path_init(nd, flags);
while (!(error = link_path_walk(s, nd)) &&
(s = open_last_lookups(nd, file, op)) != NULL)
- ;
+ nd->flags &= ~LOOKUP_ALT;
if (!error)
error = do_open(nd, file, op);
terminate_walk(nd);
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -69,7 +69,7 @@ static DEFINE_IDA(mnt_group_ida);
static struct hlist_head *mount_hashtable __read_mostly;
static struct hlist_head *mountpoint_hashtable __read_mostly;
static struct kmem_cache *mnt_cache __read_mostly;
-static DECLARE_RWSEM(namespace_sem);
+DECLARE_RWSEM(namespace_sem);
static HLIST_HEAD(unmounted); /* protected by namespace_sem */
static LIST_HEAD(ex_mountpoints); /* protected by namespace_sem */

@@ -1435,6 +1435,16 @@ static inline void namespace_lock(void)
down_write(&namespace_sem);
}

+void mnt_namespace_lock_read(void)
+{
+ down_read(&namespace_sem);
+}
+
+void mnt_namespace_unlock_read(void)
+{
+ up_read(&namespace_sem);
+}
+
enum umount_tree_flags {
UMOUNT_SYNC = 1,
UMOUNT_PROPAGATE = 2,
--- a/fs/open.c
+++ b/fs/open.c
@@ -1098,6 +1098,8 @@ inline int build_open_flags(const struct
lookup_flags |= LOOKUP_BENEATH;
if (how->resolve & RESOLVE_IN_ROOT)
lookup_flags |= LOOKUP_IN_ROOT;
+ if (how->resolve & RESOLVE_ALT)
+ lookup_flags |= LOOKUP_ALT;

op->lookup_flags = lookup_flags;
return 0;
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -128,7 +128,7 @@ static int show_vfsmnt(struct seq_file *
return err;
}

-static int show_mountinfo(struct seq_file *m, struct vfsmount *mnt)
+int show_mountinfo(struct seq_file *m, struct vfsmount *mnt)
{
struct proc_mounts *p = m->private;
struct mount *r = real_mount(mnt);
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -19,7 +19,7 @@
/* List of all valid flags for the how->resolve argument: */
#define VALID_RESOLVE_FLAGS \
(RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \
- RESOLVE_BENEATH | RESOLVE_IN_ROOT)
+ RESOLVE_BENEATH | RESOLVE_IN_ROOT | RESOLVE_ALT)

/* List of all open_how "versions". */
#define OPEN_HOW_SIZE_VER0 24 /* sizeof first published struct */
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -15,7 +15,7 @@ enum { MAX_NESTED_LINKS = 8 };
/*
* Type of the last component on LOOKUP_PARENT
*/
-enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
+enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_META};

/* pathwalk mode */
#define LOOKUP_FOLLOW 0x0001 /* follow links at the end */
@@ -27,6 +27,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LA

#define LOOKUP_REVAL 0x0020 /* tell ->d_revalidate() to trust no cache */
#define LOOKUP_RCU 0x0040 /* RCU pathwalk mode; semi-internal */
+#define LOOKUP_ALT 0x200000 /* Alternative path walk mode */

/* These tell filesystem methods that we are dealing with the final component... */
#define LOOKUP_OPEN 0x0100 /* ... in open */
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -88,6 +88,7 @@
#define BPF_FS_MAGIC 0xcafe4a11
#define AAFS_MAGIC 0x5a3c69f0
#define ZONEFS_MAGIC 0x5a4f4653
+#define FSMETA_MAGIC 0x9f8ea387

/* Since UDF 2.01 is ISO 13346 based... */
#define UDF_SUPER_MAGIC 0x15013346
--- a/include/uapi/linux/openat2.h
+++ b/include/uapi/linux/openat2.h
@@ -35,5 +35,7 @@ struct open_how {
#define RESOLVE_IN_ROOT 0x10 /* Make all jumps to "/" and ".."
be scoped inside the dirfd
(similar to chroot(2)). */
+#define RESOLVE_ALT 0x20 /* Alternative path walk mode where
+ multiple slashes have special meaning */

#endif /* _UAPI_LINUX_OPENAT2_H */

2020-08-11 14:09:41

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 03:54:19PM +0200, Miklos Szeredi wrote:
> On Wed, Aug 05, 2020 at 10:24:23AM +0200, Miklos Szeredi wrote:
> > On Tue, Aug 4, 2020 at 4:36 PM Miklos Szeredi <[email protected]> wrote:
> >
> > > I think we already lost that with the xattr API, that should have been
> > > done in a way that fits this philosophy. But given that we have "/"
> > > as the only special purpose char in filenames, and even repetitions
> > > are allowed, it's hard to think of a good way to do that. Pity.
> >
> > One way this could be solved is to allow opting into an alternative
> > path resolution mode.
> >
> > E.g.
> > openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT);
>
> Proof of concept patch and test program below.
>
> Opted for triple slash in the hope that just maybe we could add a global
> /proc/sys/fs/resolve_alt knob to optionally turn on alternative (non-POSIX) path
> resolution without breaking too many things. Will try that later...
>
> Comments?

Hell, NO. This is unspeakably tasteless. And full of lovely corner cases wrt
symlink bodies, etc.

Consider that one NAKed. I'm seriously unhappy with the entire fsinfo thing
in general, but this one is really over the top.

2020-08-11 14:23:26

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 4:08 PM Al Viro <[email protected]> wrote:
>
> On Tue, Aug 11, 2020 at 03:54:19PM +0200, Miklos Szeredi wrote:
> > On Wed, Aug 05, 2020 at 10:24:23AM +0200, Miklos Szeredi wrote:
> > > On Tue, Aug 4, 2020 at 4:36 PM Miklos Szeredi <[email protected]> wrote:
> > >
> > > > I think we already lost that with the xattr API, that should have been
> > > > done in a way that fits this philosophy. But given that we have "/"
> > > > as the only special purpose char in filenames, and even repetitions
> > > > are allowed, it's hard to think of a good way to do that. Pity.
> > >
> > > One way this could be solved is to allow opting into an alternative
> > > path resolution mode.
> > >
> > > E.g.
> > > openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT);
> >
> > Proof of concept patch and test program below.
> >
> > Opted for triple slash in the hope that just maybe we could add a global
> > /proc/sys/fs/resolve_alt knob to optionally turn on alternative (non-POSIX) path
> > resolution without breaking too many things. Will try that later...
> >
> > Comments?
>
> Hell, NO. This is unspeakably tasteless. And full of lovely corner cases wrt
> symlink bodies, etc.

It's disabled inside symlink body resolution.

Rules are simple:

- strip off trailing part after first instance of ///
- perform path lookup as normal
- resolve meta path after /// on result of normal lookup

Thanks,
Miklos

2020-08-11 14:32:26

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 04:22:19PM +0200, Miklos Szeredi wrote:
> On Tue, Aug 11, 2020 at 4:08 PM Al Viro <[email protected]> wrote:
> >
> > On Tue, Aug 11, 2020 at 03:54:19PM +0200, Miklos Szeredi wrote:
> > > On Wed, Aug 05, 2020 at 10:24:23AM +0200, Miklos Szeredi wrote:
> > > > On Tue, Aug 4, 2020 at 4:36 PM Miklos Szeredi <[email protected]> wrote:
> > > >
> > > > > I think we already lost that with the xattr API, that should have been
> > > > > done in a way that fits this philosophy. But given that we have "/"
> > > > > as the only special purpose char in filenames, and even repetitions
> > > > > are allowed, it's hard to think of a good way to do that. Pity.
> > > >
> > > > One way this could be solved is to allow opting into an alternative
> > > > path resolution mode.
> > > >
> > > > E.g.
> > > > openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT);
> > >
> > > Proof of concept patch and test program below.
> > >
> > > Opted for triple slash in the hope that just maybe we could add a global
> > > /proc/sys/fs/resolve_alt knob to optionally turn on alternative (non-POSIX) path
> > > resolution without breaking too many things. Will try that later...
> > >
> > > Comments?
> >
> > Hell, NO. This is unspeakably tasteless. And full of lovely corner cases wrt
> > symlink bodies, etc.
>
> It's disabled inside symlink body resolution.
>
> Rules are simple:
>
> - strip off trailing part after first instance of ///
> - perform path lookup as normal
> - resolve meta path after /// on result of normal lookup

... and interpolation of relative symlink body into the pathname does change
behaviour now, *including* the cases when said symlink body does not contain
that triple-X^Hslash garbage. Wonderful...

2020-08-11 14:37:23

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 10:33:59AM -0400, Tang Jiye wrote:
> anyone knows how to post a question?

Generally the way you just have, except that you generally
put it *after* the relevant parts of the quoted text (and
removes the irrelevant ones).

2020-08-11 14:37:55

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 4:31 PM Al Viro <[email protected]> wrote:
>
> On Tue, Aug 11, 2020 at 04:22:19PM +0200, Miklos Szeredi wrote:
> > On Tue, Aug 11, 2020 at 4:08 PM Al Viro <[email protected]> wrote:
> > >
> > > On Tue, Aug 11, 2020 at 03:54:19PM +0200, Miklos Szeredi wrote:
> > > > On Wed, Aug 05, 2020 at 10:24:23AM +0200, Miklos Szeredi wrote:
> > > > > On Tue, Aug 4, 2020 at 4:36 PM Miklos Szeredi <[email protected]> wrote:
> > > > >
> > > > > > I think we already lost that with the xattr API, that should have been
> > > > > > done in a way that fits this philosophy. But given that we have "/"
> > > > > > as the only special purpose char in filenames, and even repetitions
> > > > > > are allowed, it's hard to think of a good way to do that. Pity.
> > > > >
> > > > > One way this could be solved is to allow opting into an alternative
> > > > > path resolution mode.
> > > > >
> > > > > E.g.
> > > > > openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT);
> > > >
> > > > Proof of concept patch and test program below.
> > > >
> > > > Opted for triple slash in the hope that just maybe we could add a global
> > > > /proc/sys/fs/resolve_alt knob to optionally turn on alternative (non-POSIX) path
> > > > resolution without breaking too many things. Will try that later...
> > > >
> > > > Comments?
> > >
> > > Hell, NO. This is unspeakably tasteless. And full of lovely corner cases wrt
> > > symlink bodies, etc.
> >
> > It's disabled inside symlink body resolution.
> >
> > Rules are simple:
> >
> > - strip off trailing part after first instance of ///
> > - perform path lookup as normal
> > - resolve meta path after /// on result of normal lookup
>
> ... and interpolation of relative symlink body into the pathname does change
> behaviour now, *including* the cases when said symlink body does not contain
> that triple-X^Hslash garbage. Wonderful...

Can you please explain?

Thanks,
Miklos

2020-08-11 14:43:48

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 04:36:32PM +0200, Miklos Szeredi wrote:

> > > - strip off trailing part after first instance of ///
> > > - perform path lookup as normal
> > > - resolve meta path after /// on result of normal lookup
> >
> > ... and interpolation of relative symlink body into the pathname does change
> > behaviour now, *including* the cases when said symlink body does not contain
> > that triple-X^Hslash garbage. Wonderful...
>
> Can you please explain?

Currently substituting the body of a relative symlink in place of its name
results in equivalent pathname. With your patch that is not just no longer
true, it's no longer true even when the symlink body does not contain that
/// kludge - it can come in part from the symlink body and in part from the
rest of pathname. I.e. you can't even tell if substitution is an equivalent
replacement by looking at the symlink body alone.

2020-08-11 14:49:52

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 4:42 PM Al Viro <[email protected]> wrote:
>
> On Tue, Aug 11, 2020 at 04:36:32PM +0200, Miklos Szeredi wrote:
>
> > > > - strip off trailing part after first instance of ///
> > > > - perform path lookup as normal
> > > > - resolve meta path after /// on result of normal lookup
> > >
> > > ... and interpolation of relative symlink body into the pathname does change
> > > behaviour now, *including* the cases when said symlink body does not contain
> > > that triple-X^Hslash garbage. Wonderful...
> >
> > Can you please explain?
>
> Currently substituting the body of a relative symlink in place of its name
> results in equivalent pathname.

Except proc symlinks, that is.

> With your patch that is not just no longer
> true, it's no longer true even when the symlink body does not contain that
> /// kludge - it can come in part from the symlink body and in part from the
> rest of pathname. I.e. you can't even tell if substitution is an equivalent
> replacement by looking at the symlink body alone.

Yes, that's true not just for symlink bodies but any concatenation of
two path segments.

That's why it's enabled with RESOLVE_ALT. I've said that I plan to
experiment with turning this on globally, but that doesn't mean it's
necessarily a good idea. The posted patch contains nothing of that
sort.

Thanks,
Miklos

2020-08-11 15:21:39

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

[ I missed the beginning of this discussion, so maybe this was already
suggested ]

On Tue, Aug 11, 2020 at 6:54 AM Miklos Szeredi <[email protected]> wrote:
>
> >
> > E.g.
> > openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT);
>
> Proof of concept patch and test program below.

I don't think this works for the reasons Al says, but a slight
modification might.

IOW, if you do something more along the lines of

fd = open(""foo/bar", O_PATH);
metadatafd = openat(fd, "metadataname", O_ALT);

it might be workable.

So you couldn't do it with _one_ pathname, because that is always
fundamentally going to hit pathname lookup rules.

But if you start a new path lookup with new rules, that's fine.

This is what I think xattrs should always have done, because they are
broken garbage.

In fact, if we do it right, I think we could have "getxattr()" be 100%
equivalent to (modulo all the error handling that this doesn't do, of
course):

ssize_t getxattr(const char *path, const char *name,
void *value, size_t size)
{
int fd, attrfd;

fd = open(path, O_PATH);
attrfd = openat(fd, name, O_ALT);
close(fd);
read(attrfd, value, size);
close(attrfd);
}

and you'd still use getxattr() and friends as a shorthand (and for
POSIX compatibility), but internally in the kernel we'd have a
interface around that "xattrs are just file handles" model.

Linus

2020-08-11 15:33:46

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 5:20 PM Linus Torvalds
<[email protected]> wrote:
>
> [ I missed the beginning of this discussion, so maybe this was already
> suggested ]
>
> On Tue, Aug 11, 2020 at 6:54 AM Miklos Szeredi <[email protected]> wrote:
> >
> > >
> > > E.g.
> > > openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT);
> >
> > Proof of concept patch and test program below.
>
> I don't think this works for the reasons Al says, but a slight
> modification might.
>
> IOW, if you do something more along the lines of
>
> fd = open(""foo/bar", O_PATH);
> metadatafd = openat(fd, "metadataname", O_ALT);
>
> it might be workable.

That would have been my backup suggestion, in case the unified
namespace doesn't work out.

I wouldn't think the normal lookup rules really get in the way if we
explicitly enable alternative path lookup with a flag. The rules just
need to be documented.

What's the disadvantage of doing it with a single lookup WITH an enabling flag?

It's definitely not going to break anything, so no backward
compatibility issues whatsoever.

Thanks,
Miklos

2020-08-11 15:40:26

by Andy Lutomirski

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

> On Aug 11, 2020, at 8:20 AM, Linus Torvalds <[email protected]> wrote:
>
> [ I missed the beginning of this discussion, so maybe this was already
> suggested ]
>
>> On Tue, Aug 11, 2020 at 6:54 AM Miklos Szeredi <[email protected]> wrote:
>>
>>>
>>> E.g.
>>> openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT);
>>
>> Proof of concept patch and test program below.
>
> I don't think this works for the reasons Al says, but a slight
> modification might.
>
> IOW, if you do something more along the lines of
>
> fd = open(""foo/bar", O_PATH);
> metadatafd = openat(fd, "metadataname", O_ALT);
>
> it might be workable.
>
> So you couldn't do it with _one_ pathname, because that is always
> fundamentally going to hit pathname lookup rules.
>
> But if you start a new path lookup with new rules, that's fine.
>
> This is what I think xattrs should always have done, because they are
> broken garbage.
>
> In fact, if we do it right, I think we could have "getxattr()" be 100%
> equivalent to (modulo all the error handling that this doesn't do, of
> course):
>
> ssize_t getxattr(const char *path, const char *name,
> void *value, size_t size)
> {
> int fd, attrfd;
>
> fd = open(path, O_PATH);
> attrfd = openat(fd, name, O_ALT);
> close(fd);
> read(attrfd, value, size);
> close(attrfd);
> }
>
> and you'd still use getxattr() and friends as a shorthand (and for
> POSIX compatibility), but internally in the kernel we'd have a
> interface around that "xattrs are just file handles" model.
>
>

This is a lot like a less nutty version of NTFS streams, whereas the /// idea is kind of like an extra-nutty version of NTFS streams.

I am personally not a fan of the in-band signaling implications of overloading /. For example, there is plenty of code out there that thinks that (a + “/“ + b) concatenates paths. With /// overloaded, this stops being true.

2020-08-11 16:06:38

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 08:20:24AM -0700, Linus Torvalds wrote:

> I don't think this works for the reasons Al says, but a slight
> modification might.
>
> IOW, if you do something more along the lines of
>
> fd = open(""foo/bar", O_PATH);
> metadatafd = openat(fd, "metadataname", O_ALT);
>
> it might be workable.
>
> So you couldn't do it with _one_ pathname, because that is always
> fundamentally going to hit pathname lookup rules.
>
> But if you start a new path lookup with new rules, that's fine.

Except that you suddenly see non-directory dentries get children.
And a lot of dcache-related logics needs to be changed if that
becomes possible.

I agree that xattrs are garbage, but this approach won't be
a straightforward solution. Can those suckers be passed to
...at() as starting points? Can they be bound in namespace?
Can something be bound *on* them? What do they have for inodes
and what maintains their inumbers (and st_dev, while we are at
it)? Can _they_ have secondaries like that (sensu Swift)?
Is that a flat space, or can they be directories?

Only a part of the problems is implementation-related (and those are
not trivial at all); most the fun comes from semantics of those things.
And answers to the implementation questions are seriously dependent upon
that...

2020-08-11 16:07:52

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 8:30 AM Miklos Szeredi <[email protected]> wrote:
>
> What's the disadvantage of doing it with a single lookup WITH an enabling flag?
>
> It's definitely not going to break anything, so no backward
> compatibility issues whatsoever.

No backwards compatibility issues for existing programs, no.

But your suggestion is fundamentally ambiguous, and you most
definitely *can* hit that if people start using this in new programs.

Where does that "unified" pathname come from? It will be generated
from "base filename + metadata name" in user space, and

(a) the base filename might have double or triple slashes in it for
whatever reasons.

This is not some "made-up gotcha" thing - I see double slashes *all*
the time when we have things like Makefiles doing

srctree=../../src/

and then people do "$(srctree)/". If you haven't seen that kind of
pattern where the pathname has two (or sometimes more!) slashes in the
middle, you've led a very sheltered life.

(b) even if the new user space were to think about that, and remove
those (hah! when have you ever seen user space do that?), as Al
mentioned, the user *filesystem* might have pathnames with double
slashes as part of symlinks.

So now we'd have to make sure that when we traverse symlinks, that
O_ALT gets cleared. Which means that it's not a unified namespace
after all, because you can't make symlinks point to metadata.

Or we'd retroactively change the semantics of a symlink, and that _is_
a backwards compatibility issue. Not with old software, no, but it
changes the meaning of old symlinks!

So no, I don't think a unified namespace ends up working.

And I say that as somebody who actually loves the concept. Ask Al: I
have a few times pushed for "let's allow directory behavior on regular
files", so that you could do things like a tar-filesystem, and access
the contents of a tar-file by just doing

cat my-file.tar/inside/the/archive.c

or similar.

Al has convinced me it's a horrible idea (and there you have a
non-ambiguous marker: the slash at the end of a pathname that
otherwise looks and acts as a non-directory)

Linus

2020-08-11 16:11:23

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 9:05 AM Al Viro <[email protected]> wrote:
>
> Except that you suddenly see non-directory dentries get children.
> And a lot of dcache-related logics needs to be changed if that
> becomes possible.

Yeah, I think you'd basically need to associate a (dynamic)
mount-point to that path when you start doing O_ALT. Or something.

And it might not be reasonably implementable. I just think that as
_interface_ it's unambiguous and fairly clean, and if Miklos can
implement something like that, I think it would be maintainable.

No?

Linus

2020-08-11 16:18:44

by Casey Schaufler

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On 8/11/2020 8:39 AM, Andy Lutomirski wrote:
>
>> On Aug 11, 2020, at 8:20 AM, Linus Torvalds <[email protected]> wrote:
>>
>> [ I missed the beginning of this discussion, so maybe this was already
>> suggested ]
>>
>>> On Tue, Aug 11, 2020 at 6:54 AM Miklos Szeredi <[email protected]> wrote:
>>>
>>>> E.g.
>>>> openat(AT_FDCWD, "foo/bar//mnt/info", O_RDONLY | O_ALT);
>>> Proof of concept patch and test program below.
>> I don't think this works for the reasons Al says, but a slight
>> modification might.
>>
>> IOW, if you do something more along the lines of
>>
>> fd = open(""foo/bar", O_PATH);
>> metadatafd = openat(fd, "metadataname", O_ALT);
>>
>> it might be workable.
>>
>> So you couldn't do it with _one_ pathname, because that is always
>> fundamentally going to hit pathname lookup rules.
>>
>> But if you start a new path lookup with new rules, that's fine.
>>
>> This is what I think xattrs should always have done, because they are
>> broken garbage.
>>
>> In fact, if we do it right, I think we could have "getxattr()" be 100%
>> equivalent to (modulo all the error handling that this doesn't do, of
>> course):
>>
>> ssize_t getxattr(const char *path, const char *name,
>> void *value, size_t size)
>> {known
>> int fd, attrfd;
>>
>> fd = open(path, O_PATH);
>> attrfd = openat(fd, name, O_ALT);
>> close(fd);
>> read(attrfd, value, size);
>> close(attrfd);
>> }
>>
>> and you'd still use getxattr() and friends as a shorthand (and for
>> POSIX compatibility), but internally in the kernel we'd have a
>> interface around that "xattrs are just file handles" model.

This doesn't work so well for setxattr(), which we want to be atomic.

> This is a lot like a less nutty version of NTFS streams, whereas the /// idea is kind of like an extra-nutty version of NTFS streams.
>
> I am personally not a fan of the in-band signaling implications of overloading /. For example, there is plenty of code out there that thinks that (a + “/“ + b) concatenates paths. With /// overloaded, this stops being true.

Since a////////b has known meaning, and lots of applications
play loose with '/', its really dangerous to treat the string as
special. We only get away with '.' and '..' because their behavior
was defined before many of y'all were born.

2020-08-11 16:35:32

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 9:17 AM Casey Schaufler <[email protected]> wrote:
>
> This doesn't work so well for setxattr(), which we want to be atomic.

Well, it's not like the old interfaces could go away. But yes, doing

metadatafd = openat(fd, "metadataname", O_ALT | O_CREAT | O_EXCL)

to create a new xattr (and then write to it) would not act like
setxattr(). Even if you do it as one atomic write, a reader would see
that zero-sized xattr between the O_CREAT and the write.

Of course, we could just hide zero-sized xattrs from the legacy
interfaces and avoid things like that, but another option is to say
that only the legacy interfaces give that particular atomicity
guarantee.

> Since a////////b has known meaning, and lots of applications
> play loose with '/', its really dangerous to treat the string as
> special. We only get away with '.' and '..' because their behavior
> was defined before many of y'all were born.

Yeah, I really don't think it's a good idea to play with "//".

POSIX does allow special semantics for a pathname with "//" at the
*beginning*, but even that has been very questionable (and Linux has
never supported it).

Linus

2020-08-11 16:40:05

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 09:09:36AM -0700, Linus Torvalds wrote:
> On Tue, Aug 11, 2020 at 9:05 AM Al Viro <[email protected]> wrote:
> >
> > Except that you suddenly see non-directory dentries get children.
> > And a lot of dcache-related logics needs to be changed if that
> > becomes possible.
>
> Yeah, I think you'd basically need to associate a (dynamic)
> mount-point to that path when you start doing O_ALT. Or something.

Whee... That's going to be non-workable for xattrs - fgetxattr()
needs to work after unlink(). And you'd obviously need to prevent
crossing into that sucker on normal lookups, which would add quite
a few interesting twists around the automount points.

I'm not saying it's not doable, but it won't be anywhere near
straightforward. And API semantics questions are still there...

2020-08-11 18:53:21

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 6:05 PM Linus Torvalds
<[email protected]> wrote:

> and then people do "$(srctree)/". If you haven't seen that kind of
> pattern where the pathname has two (or sometimes more!) slashes in the
> middle, you've led a very sheltered life.

Oh, I have. That's why I opted for triple slashes, since that should
work most of the time even in those concatenated cases. And yes, I
know, most is not always, and this might just be hiding bugs, etc...
I think the pragmatic approach would be to try this and see how many
triple slash hits a normal workload gets and if it's reasonably low,
then hopefully that together with warnings for O_ALT would be enough.

> (b) even if the new user space were to think about that, and remove
> those (hah! when have you ever seen user space do that?), as Al
> mentioned, the user *filesystem* might have pathnames with double
> slashes as part of symlinks.
>
> So now we'd have to make sure that when we traverse symlinks, that
> O_ALT gets cleared.

That's exactly what I implemented in the proof of concept patch.

> Which means that it's not a unified namespace
> after all, because you can't make symlinks point to metadata.

I don't think that's a great deal. Also I think other limitations
would make sense:

- no mounts allowed under ///
- no ./.. resolution after ///
- no hardlinks
- no special files, just regular and directory
- no seeking (regular or dir)

> cat my-file.tar/inside/the/archive.c
>
> or similar.
>
> Al has convinced me it's a horrible idea (and there you have a
> non-ambiguous marker: the slash at the end of a pathname that
> otherwise looks and acts as a non-directory)

Umm, can you remind me what's so horrible about that? Yeah, hard
linked directories are a no-no. But it doesn't have to be implemented
in a way to actually be a problem with hard links.

Thanks,
Miklos

2020-08-11 19:32:30

by Lennart Poettering

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Di, 11.08.20 20:49, Miklos Szeredi ([email protected]) wrote:

> On Tue, Aug 11, 2020 at 6:05 PM Linus Torvalds
> <[email protected]> wrote:
>
> > and then people do "$(srctree)/". If you haven't seen that kind of
> > pattern where the pathname has two (or sometimes more!) slashes in the
> > middle, you've led a very sheltered life.
>
> Oh, I have. That's why I opted for triple slashes, since that should
> work most of the time even in those concatenated cases. And yes, I
> know, most is not always, and this might just be hiding bugs, etc...
> I think the pragmatic approach would be to try this and see how many
> triple slash hits a normal workload gets and if it's reasonably low,
> then hopefully that together with warnings for O_ALT would be enough.

There's no point. Userspace relies on the current meaning of triple
slashes. It really does.

I know many places in systemd where we might end up with a triple
slash. Here's a real-life example: some code wants to access the
cgroup attribute 'cgroup.controllers' of the root cgroup. It thus
generates the right path in the fs for it, which is the concatenation of
"/sys/fs/cgroup/" (because that's where cgroupfs is mounted), of "/"
(i.e. for the root cgroup) and of "/cgroup.controllers" (as that's the
file the attribute is exposed under).

And there you go:

"/sys/fs/cgroup/" + "/" + "/cgroup.controllers" → "/sys/fs/cgroup///cgroup.controllers"

This is a real-life thing. Don't break this please.

Lennart

--
Lennart Poettering, Berlin

2020-08-11 19:40:43

by Christian Brauner

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 09:05:22AM -0700, Linus Torvalds wrote:
> On Tue, Aug 11, 2020 at 8:30 AM Miklos Szeredi <[email protected]> wrote:
> >
> > What's the disadvantage of doing it with a single lookup WITH an enabling flag?
> >
> > It's definitely not going to break anything, so no backward
> > compatibility issues whatsoever.
>
> No backwards compatibility issues for existing programs, no.
>
> But your suggestion is fundamentally ambiguous, and you most
> definitely *can* hit that if people start using this in new programs.
>
> Where does that "unified" pathname come from? It will be generated
> from "base filename + metadata name" in user space, and
>
> (a) the base filename might have double or triple slashes in it for
> whatever reasons.
>
> This is not some "made-up gotcha" thing - I see double slashes *all*
> the time when we have things like Makefiles doing
>
> srctree=../../src/
>
> and then people do "$(srctree)/". If you haven't seen that kind of
> pattern where the pathname has two (or sometimes more!) slashes in the
> middle, you've led a very sheltered life.
>
> (b) even if the new user space were to think about that, and remove
> those (hah! when have you ever seen user space do that?), as Al
> mentioned, the user *filesystem* might have pathnames with double
> slashes as part of symlinks.
>
> So now we'd have to make sure that when we traverse symlinks, that
> O_ALT gets cleared. Which means that it's not a unified namespace
> after all, because you can't make symlinks point to metadata.
>
> Or we'd retroactively change the semantics of a symlink, and that _is_
> a backwards compatibility issue. Not with old software, no, but it
> changes the meaning of old symlinks!
>
> So no, I don't think a unified namespace ends up working.
>
> And I say that as somebody who actually loves the concept. Ask Al: I
> have a few times pushed for "let's allow directory behavior on regular
> files", so that you could do things like a tar-filesystem, and access
> the contents of a tar-file by just doing
>
> cat my-file.tar/inside/the/archive.c
>
> or similar.
>
> Al has convinced me it's a horrible idea (and there you have a
> non-ambiguous marker: the slash at the end of a pathname that
> otherwise looks and acts as a non-directory)
>

Putting my kernel hat down, putting my userspace hat on.

I'm looking at this from a potential user of this interface.
I'm not a huge fan of the metadata fd approach I'd much rather have a
dedicated system call rather than opening a side-channel metadata fd
that I can read binary data from. Maybe I'm alone in this but I was
under the impression that other users including Ian, Lennart, and Karel
have said on-list in some form that they would prefer this approach.
There are even patches for systemd and libmount, I thought?

But if we want to go down a completely different route then I'd prefer
if this metadata fd with "special semantics" did not in any way alter
the meaning of regular paths. This has the potential to cause a lot of
churn for userspace. I think having to play concatenation games in
shared libraries for mount information is a bad plan in addition to all
the issues you raised here.

Christian

2020-08-11 19:55:11

by Christian Brauner

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 09:31:05PM +0200, Lennart Poettering wrote:
> On Di, 11.08.20 20:49, Miklos Szeredi ([email protected]) wrote:
>
> > On Tue, Aug 11, 2020 at 6:05 PM Linus Torvalds
> > <[email protected]> wrote:
> >
> > > and then people do "$(srctree)/". If you haven't seen that kind of
> > > pattern where the pathname has two (or sometimes more!) slashes in the
> > > middle, you've led a very sheltered life.
> >
> > Oh, I have. That's why I opted for triple slashes, since that should
> > work most of the time even in those concatenated cases. And yes, I
> > know, most is not always, and this might just be hiding bugs, etc...
> > I think the pragmatic approach would be to try this and see how many
> > triple slash hits a normal workload gets and if it's reasonably low,
> > then hopefully that together with warnings for O_ALT would be enough.
>
> There's no point. Userspace relies on the current meaning of triple
> slashes. It really does.
>
> I know many places in systemd where we might end up with a triple
> slash. Here's a real-life example: some code wants to access the
> cgroup attribute 'cgroup.controllers' of the root cgroup. It thus
> generates the right path in the fs for it, which is the concatenation of
> "/sys/fs/cgroup/" (because that's where cgroupfs is mounted), of "/"
> (i.e. for the root cgroup) and of "/cgroup.controllers" (as that's the
> file the attribute is exposed under).
>
> And there you go:
>
> "/sys/fs/cgroup/" + "/" + "/cgroup.controllers" → "/sys/fs/cgroup///cgroup.controllers"
>
> This is a real-life thing. Don't break this please.

Taken from a log from a container:

lxc f4 20200810105815.742 TRACE cgfsng - cgroups/cgfsng.c:cg_legacy_handle_cpuset_hierarchy:552 - "cgroup.clone_children" was already set to "1"
lxc f4 20200810105815.742 WARN cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1152 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset///lxc.monitor.f4"
lxc f4 20200810105815.743 INFO cgfsng - cgroups/cgfsng.c:cgfsng_monitor_create:1366 - The monitor process uses "lxc.monitor.f4" as cgroup
lxc f4 20200810105815.743 DEBUG storage - storage/storage.c:get_storage_by_name:211 - Detected rootfs type "dir"
lxc f4 20200810105815.743 TRACE cgfsng - cgroups/cgfsng.c:cg_legacy_handle_cpuset_hierarchy:552 - "cgroup.clone_children" was already set to "1"
lxc f4 20200810105815.743 WARN cgfsng - cgroups/cgfsng.c:mkdir_eexist_on_last:1152 - File exists - Failed to create directory "/sys/fs/cgroup/cpuset///lxc.payload.f4"
lxc f4 20200810105815.743 INFO cgfsng - cgroups/cgfsng.c:cgfsng_payload_create:1469 - The container process uses "lxc.payload.f4" as cgroup
lxc f4 20200810105815.744 TRACE start - start.c:lxc_spawn:1731 - Spawned container directly into target cgroup via cgroup2 fd 17

Christian

2020-08-11 20:29:37

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 6:17 PM Casey Schaufler <[email protected]> wrote:

> Since a////////b has known meaning, and lots of applications
> play loose with '/', its really dangerous to treat the string as
> special. We only get away with '.' and '..' because their behavior
> was defined before many of y'all were born.

So the founding fathers have set things in stone and now we can't
change it. Right?

Well that's how it looks... but let's think a little; we have '/' and
'\0' that can't be used in filenames. Also '.' and '..' are
prohibited names. It's not a trivial limitation, so applications are
probably not used to dumping binary data into file names. And that
means it's probably possible to find a fairly short combination that
is never used in practice (probably containing the "/." sequence).
Why couldn't we reserve such a combination now?

I have no idea how to find such it, but other than that, I see no
theoretical problem with extending the list of reserved filenames.

Thanks,
Miklos

2020-08-11 20:41:45

by Jann Horn

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 10:29 PM Miklos Szeredi <[email protected]> wrote:
> On Tue, Aug 11, 2020 at 6:17 PM Casey Schaufler <[email protected]> wrote:
> > Since a////////b has known meaning, and lots of applications
> > play loose with '/', its really dangerous to treat the string as
> > special. We only get away with '.' and '..' because their behavior
> > was defined before many of y'all were born.
>
> So the founding fathers have set things in stone and now we can't
> change it. Right?
>
> Well that's how it looks... but let's think a little; we have '/' and
> '\0' that can't be used in filenames. Also '.' and '..' are
> prohibited names. It's not a trivial limitation, so applications are
> probably not used to dumping binary data into file names. And that
> means it's probably possible to find a fairly short combination that
> is never used in practice (probably containing the "/." sequence).
> Why couldn't we reserve such a combination now?

This isn't just about finding a string that "is never used in
practice". There is userspace software that performs security checks
based on the precise semantics that paths have nowadays, and those
security checks will sometimes happily let you use arbitrary binary
garbage in path components as long as there's no '\0' or '/' in there
and the name isn't "." or "..", because that's just how paths work on
Linux.

If you change the semantics of path strings, you'd have to be
confident that the new semantics fit nicely with all the path
validation routines that exist scattered across userspace, and don't
expose new interfaces through file server software and setuid binaries
and so on.

I really don't like this idea.

2020-08-11 20:59:12

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 10:37 PM Jann Horn <[email protected]> wrote:
> If you change the semantics of path strings, you'd have to be
> confident that the new semantics fit nicely with all the path
> validation routines that exist scattered across userspace, and don't
> expose new interfaces through file server software and setuid binaries
> and so on.

So that's where O_ALT comes in. If the application is consenting,
then that should prevent exploits. Or?

Thanks,
Miklos

2020-08-11 21:18:57

by Andy Lutomirski

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 1:56 PM Miklos Szeredi <[email protected]> wrote:
>
> On Tue, Aug 11, 2020 at 10:37 PM Jann Horn <[email protected]> wrote:
> > If you change the semantics of path strings, you'd have to be
> > confident that the new semantics fit nicely with all the path
> > validation routines that exist scattered across userspace, and don't
> > expose new interfaces through file server software and setuid binaries
> > and so on.
>
> So that's where O_ALT comes in. If the application is consenting,
> then that should prevent exploits. Or?

We're going to be at risk from libraries that want to use the new
O_ALT mechanism but are invoked by old code that passes traditional
Linux paths. Each library will have to sanitize paths, and some will
screw it up.

I much prefer Linus' variant where the final part of the extended path
is passed as a separate parameter.

2020-08-11 21:21:08

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 1:56 PM Miklos Szeredi <[email protected]> wrote:
>
> So that's where O_ALT comes in. If the application is consenting,
> then that should prevent exploits. Or?

If the application is consenting AND GETS IT RIGHT it should prevent exploits.

But that's a big deal.

Why not just do it the way I suggested? Then you don't have any of these issues.

Linus

2020-08-11 21:21:28

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 10:28:31PM +0200, Miklos Szeredi wrote:
> On Tue, Aug 11, 2020 at 6:17 PM Casey Schaufler <[email protected]> wrote:
>
> > Since a////////b has known meaning, and lots of applications
> > play loose with '/', its really dangerous to treat the string as
> > special. We only get away with '.' and '..' because their behavior
> > was defined before many of y'all were born.
>
> So the founding fathers have set things in stone and now we can't
> change it. Right?

Right.

> Well that's how it looks... but let's think a little; we have '/' and
> '\0' that can't be used in filenames. Also '.' and '..' are
> prohibited names. It's not a trivial limitation, so applications are
> probably not used to dumping binary data into file names. And that
> means it's probably possible to find a fairly short combination that
> is never used in practice (probably containing the "/." sequence).

No, it is not. Miklos, get real - you will end up with obscure
pathname produced once in a while by a script fragment from hell
spewed out by crusty piece of awk buried in a piece of shit
makefile from hell (and you are lucky if it won't be an automake
output, while we are at it). Exercised only when some shipped
turd needs to be regenerated. Have you _ever_ tried to debug e.g.
gcc build problems? I have, and it's extremely unpleasant. Failures
tend to be obscure as hell, backtracking them through the makefiles
is a massive PITA and figuring out why said piece of awk produces
what it does...

I know what I would've done if the likely 5 hours of cursing everything
would have ended up with discovery that some luser had assumed that
surely, no sane software would ever generate this sequence of characters
in anything used as a pathname, and that for this reason I'm looking
forward to several more hours of playing with utterly revolting crap
to convince it to stay away from that sequence...

> Why couldn't we reserve such a combination now?
>
> I have no idea how to find such it, but other than that, I see no
> theoretical problem with extending the list of reserved filenames.

"not breaking userland", for one.

2020-08-11 21:36:35

by Casey Schaufler

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On 8/11/2020 1:28 PM, Miklos Szeredi wrote:
> On Tue, Aug 11, 2020 at 6:17 PM Casey Schaufler <[email protected]> wrote:
>
>> Since a////////b has known meaning, and lots of applications
>> play loose with '/', its really dangerous to treat the string as
>> special. We only get away with '.' and '..' because their behavior
>> was defined before many of y'all were born.
> So the founding fathers have set things in stone and now we can't
> change it. Right?

The founders did lots of things that, in retrospect, weren't
such great ideas, but that we have to live with.

> Well that's how it looks... but let's think a little; we have '/' and
> '\0' that can't be used in filenames. Also '.' and '..' are
> prohibited names. It's not a trivial limitation, so applications are
> probably not used to dumping binary data into file names.

Hee Hee. Back in the early days of UNIX (the 1970s) there was command
dsw(1) "delete from switches" because files with untypeible names where
unfortunately common. I would question the assertion that "applications
are not used to dumping binary data into file names", based on how
often I've wished we still had dsw(1).

> And that
> means it's probably possible to find a fairly short combination that
> is never used in practice (probably containing the "/." sequence).

You'd think, but you'd be wrong. In the UNIX days we tried everything
from "..." to ".NO_HID." and there always arose a problem or two. Not
the least of which is that a "magic" pathname generated on an old system,
then mounted on a new system will never give you the results you want.

> Why couldn't we reserve such a combination now?
>
> I have no idea how to find such it, but other than that, I see no
> theoretical problem with extending the list of reserved filenames.

You need a sequence that is never used in any language, and
that has never been used as a magic shell sequence. If you want
a fun story to tell over beers, look up how using the "@" as the
erase character on a TTY33 lead to it being used in email addresses.

> Thanks,
> Miklos

2020-08-12 00:08:31

by David Howells

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

Linus Torvalds <[email protected]> wrote:

> [ I missed the beginning of this discussion, so maybe this was already
> suggested ]

Well, the start of it was my proposal of an fsinfo() system call. That at its
simplest takes an object reference (eg. a path) and an integer attribute ID (it
could use a string instead, I suppose, but it would mean a bunch of strcmps
instead of integer comparisons) and returns the value of the attribute. But I
allow you to do slightly more interesting things than that too.

Miklós seems dead-set against adding a system call specifically for this -
though he's proposed extending open in various ways and also proposed an
additional syscall, readfile(), that does the open+read+close all in one step.

I think also at some point, he (or maybe James?) proposed adding a new magic
filesystem mounted somewhere on proc (reflecting an open fd) that then had a
bunch of symlinks to somewhere in sysfs (reflecting a mount). The idea being
that you did something like:

fd = open("/path/to/object", O_PATH);
sprintf(name, "/proc/self/fds/%u/attr1", fd);
attrfd = open(name, O_RDONLY);
read(attrfd, buf1, sizeof(buf1));
close(attrfd);
sprintf(name, "/proc/self/fds/%u/attr2", fd);
attrfd = open(name, O_RDONLY);
read(attrfd, buf2, sizeof(buf2));
close(attrfd);

or:

sprintf(name, "/proc/self/fds/%u/attr1", fd);
readfile(name, buf1, sizeof(buf1));
sprintf(name, "/proc/self/fds/%u/attr2", fd);
readfile(name, buf2, sizeof(buf2));

and then "/proc/self/fds/12/attr2" might then be a symlink to, say,
"/sys/mounts/615/mount_attr".

Miklós's justification for this was that it could then be operated from a shell
script without the need for a utility - except that bash, at least, can't do
O_PATH opens.

James has proposed making fsconfig() able to retrieve attributes (though I'd
prefer to give it a sibling syscall that does the retrieval rather than making
fsconfig() do that too).

> {
> int fd, attrfd;
>
> fd = open(path, O_PATH);
> attrfd = openat(fd, name, O_ALT);
> close(fd);
> read(attrfd, value, size);
> close(attrfd);
> }

Please don't go down this path. You're proposing five syscalls - including
creating two file descriptors - to do what fsinfo() does in one.

Do you have a particular objection to adding a syscall specifically for
retrieving filesystem/VFS information?

-~-

Anyway, in case you're interested in what I want to get out of this - which is
the reason for it being posted in the first place:

(*) The ability to retrieve various attributes of a filesystem/superblock,
including information on:

- Filesystem features: Does it support things like hard links, user
quotas, direct I/O.

- Filesystem limits: What's the maximum size of a file, an xattr, a
directory; how many files can it support.

- Supported API features: What FS_IOC_GETFLAGS does it support? Which
can be set? Does it have Windows file attributes available? What
statx attributes are supported? What do the timestamps support?
What sort of case handling is done on filenames?

Note that for a lot of cases, this stuff is fixed and can just be memcpy'd
from rodata. Some of this is variable, however, in things like ext4 and
xfs, depending on, say, mkfs configuration. The situation is even more
complex with network filesystems as this may depend on the server they're
talking to.

But note also that some of this stuff might change file-to-file, even
within a superblock.

(*) The ability to retrieve attributes of a mount point, including information
on the flags, propagation settings and child lists.

(*) The ability to quickly retrieve a list of accessible mount point IDs,
with change event counters to permit userspace (eg. systemd) to quickly
determine if anything changed in the even of an overrun.

(*) The ability to find mounts/superblocks by mount ID. Paths are not unique
identifiers for mountpoints. You can stack multiple mounts on the same
directory, but a path only sees the top one.

(*) The ability to look inside a different mount namespace - one to which you
have a reference fd. This would allow a container manager to look inside
the container it is managing.

(*) The ability to expose filesystem-specific attributes. Network filesystems
can expose lists of servers and server addresses, for instance.

(*) The ability to use the object referenced to determine the namespace
(particularly the network namespace) to look in. The problem with looking
in, say, /proc/net/... is that it looks at current's net namespace -
whether or not the object of interest is in the same one.

(*) The ability to query the context attached to the fd obtained from
fsopen(). Such a context may not have a superblock attached to it yet or
may not be mounted yet.

The aim is to allow a container manager to supervise a mount being made in
a container. It kind of pairs with fsconfig() in that respect.

(*) The ability to query mount and superblock event counters to help a
watching process handle overrun in the notifications queue.

What I've done with fsinfo() is:

(*) Provided a number of ways to refer to the object to be queried (path,
dirfd+path, fd, mount ID - with others planned).

(*) Made it so that attibutes are referenced by a numeric ID to keep search
time minimal. Numeric IDs must be declared in uapi/linux/fsinfo.h.

(*) Made it so that the core does most of the work. Filesystems are given an
in-kernel buffer to copy into and don't get to see any userspace pointers.

(*) Made it so that values are not, by and large, encoded as text if it can be
avoided. Backward and forward compatibility on binary structs is handled
by the core. The filesystem just fills in the values in the UAPI struct
in the buffer. The core will zero-pad or truncate the data to match what
userspace asked for.

The UAPI struct must be declared in uapi/linux/fsinfo.h.

(*) Made it so that, for some attributes, the core will fill in the data as
best it can from what's available in the superblock, mount struct or mount
namespace. The filesystem can then amend this if it wants to.

(*) Made it so that attributes are typed. The types are few: string, struct,
list of struct, opaque. Structs are extensible: the length is the
version, a new version is required to be a superset of the old version and
excess requestage is simply cleared by the kernel.

Information about the type of an attribute can be queried by fsinfo().

What I want to avoid:

(*) Adding another magic filesystem.

(*) Adding symlinks from proc to sysfs.

(*) Having to use open to get an attribute.

(*) Having to use multiple opens to get an attribute.

(*) Having to pathwalk to get to the attribute from the object being queried.

(*) Allocating another O_ open flag for this.

(*) Avoidable text encoding and decoding.

(*) Letting the filesystem access the userspace buffer.

Note that I'm not against splitting fsinfo() into a set of sibling syscalls if
that makes it more palatable, or even against using strings for the attribute
IDs, though I'd prefer to avoid the strcmps.

David

2020-08-12 00:55:01

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, 2020-08-11 at 21:39 +0200, Christian Brauner wrote:
> On Tue, Aug 11, 2020 at 09:05:22AM -0700, Linus Torvalds wrote:
> > On Tue, Aug 11, 2020 at 8:30 AM Miklos Szeredi <[email protected]>
> > wrote:
> > > What's the disadvantage of doing it with a single lookup WITH an
> > > enabling flag?
> > >
> > > It's definitely not going to break anything, so no backward
> > > compatibility issues whatsoever.
> >
> > No backwards compatibility issues for existing programs, no.
> >
> > But your suggestion is fundamentally ambiguous, and you most
> > definitely *can* hit that if people start using this in new
> > programs.
> >
> > Where does that "unified" pathname come from? It will be generated
> > from "base filename + metadata name" in user space, and
> >
> > (a) the base filename might have double or triple slashes in it
> > for
> > whatever reasons.
> >
> > This is not some "made-up gotcha" thing - I see double slashes
> > *all*
> > the time when we have things like Makefiles doing
> >
> > srctree=../../src/
> >
> > and then people do "$(srctree)/". If you haven't seen that kind of
> > pattern where the pathname has two (or sometimes more!) slashes in
> > the
> > middle, you've led a very sheltered life.
> >
> > (b) even if the new user space were to think about that, and
> > remove
> > those (hah! when have you ever seen user space do that?), as Al
> > mentioned, the user *filesystem* might have pathnames with double
> > slashes as part of symlinks.
> >
> > So now we'd have to make sure that when we traverse symlinks, that
> > O_ALT gets cleared. Which means that it's not a unified namespace
> > after all, because you can't make symlinks point to metadata.
> >
> > Or we'd retroactively change the semantics of a symlink, and that
> > _is_
> > a backwards compatibility issue. Not with old software, no, but it
> > changes the meaning of old symlinks!
> >
> > So no, I don't think a unified namespace ends up working.
> >
> > And I say that as somebody who actually loves the concept. Ask Al:
> > I
> > have a few times pushed for "let's allow directory behavior on
> > regular
> > files", so that you could do things like a tar-filesystem, and
> > access
> > the contents of a tar-file by just doing
> >
> > cat my-file.tar/inside/the/archive.c
> >
> > or similar.
> >
> > Al has convinced me it's a horrible idea (and there you have a
> > non-ambiguous marker: the slash at the end of a pathname that
> > otherwise looks and acts as a non-directory)
> >
>
> Putting my kernel hat down, putting my userspace hat on.
>
> I'm looking at this from a potential user of this interface.
> I'm not a huge fan of the metadata fd approach I'd much rather have a
> dedicated system call rather than opening a side-channel metadata fd
> that I can read binary data from. Maybe I'm alone in this but I was
> under the impression that other users including Ian, Lennart, and
> Karel
> have said on-list in some form that they would prefer this approach.
> There are even patches for systemd and libmount, I thought?

Not quite sure what you mean here.

Karel (with some contributions by me) has implemented the interfaces
for David's mount notifications and fsinfo() call in libmount. We
still have a little more to do on that.

I also have a systemd implementation that uses these libmount features
for mount table handling that works quite well, with a couple more
things to do to complete it, that Lennart has done an initial review
for.

It's no secret that I don't like the proc file system in general
but it is really useful for many things, that's just the way it
is.

Ian

2020-08-12 07:24:35

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Tue, Aug 11, 2020 at 11:19 PM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, Aug 11, 2020 at 1:56 PM Miklos Szeredi <[email protected]> wrote:
> >
> > So that's where O_ALT comes in. If the application is consenting,
> > then that should prevent exploits. Or?
>
> If the application is consenting AND GETS IT RIGHT it should prevent exploits.
>
> But that's a big deal.
>
> Why not just do it the way I suggested? Then you don't have any of these issues.

Will do.

I just want to understand the reasons why a unified namespace is
completely out of the question. And I won't accept "it's just fugly"
or "it's the way it's always been done, so don't change it". Those
are not good reasons.

Oh, I'm used to these "fights", had them all along. In hindsight I
should have accepted others' advice in some of the cases, but in
others that big argument turned out to be a complete non-issue. One
such being inode and dentry duplication in the overlayfs case vs.
in-built stacking in the union-mount case. There were a lot of issues
with overlayfs, that's true, but dcache/icache size has NEVER actually
been reported as a problem.

While Al has a lot of experience, it's hard to accept all that
anecdotal evidence just because he says it. Your worries are also
just those: worries. They may turn out to be an issue or they may
not.

Anyway, starting with just introducing the alt namespace without
unification seems to be a good first step. If that turns out to be
workable, we can revisit unification later.

Thanks,
Miklos

2020-08-12 07:57:23

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

On Wed, Aug 12, 2020 at 2:05 AM David Howells <[email protected]> wrote:

> > {
> > int fd, attrfd;
> >
> > fd = open(path, O_PATH);
> > attrfd = openat(fd, name, O_ALT);
> > close(fd);
> > read(attrfd, value, size);
> > close(attrfd);
> > }
>
> Please don't go down this path. You're proposing five syscalls - including
> creating two file descriptors - to do what fsinfo() does in one.

So what? People argued against readfile() for exactly the opposite of
reasons, even though that's a lot less specialized than fsinfo().

Worried about performance? Io-uring will allow you to do all those
five syscalls (or many more) with just one I/O submission.

Thanks,
Miklos

2020-08-12 08:30:09

by David Howells

[permalink] [raw]

Subject: Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

Miklos Szeredi <[email protected]> wrote:

> Worried about performance? Io-uring will allow you to do all those
> five syscalls (or many more) with just one I/O submission.

io_uring isn't going to help here. We're talking about synchronous reads.
AIUI, you're adding a couple more syscalls to the list and running stuff in a
side thread to save the effort of going in and out of the kernel five times.
But you still have to pay the set up/tear down costs on the fds and do the
pathwalks. io_uring doesn't magically make that cost disappear.

io_uring also requires resources such as a kernel accessible ring buffer to
make it work.

You're proposing making everything else more messy just to avoid a dedicated
syscall. Could you please set out your reasoning for that?

David

2020-08-12 08:38:21