2020-03-30 14:07:06

by David Howells

[permalink] [raw]
Subject: Upcoming: Notifications, FS notifications and fsinfo()


Hi Linus,

I have three sets of patches I'd like to push your way, if you (and Al) are
willing to consider them.

(1) General notification queue plus key/keyring notifications.

This adds the core of the notification queue built on pipes, and adds
the ability to watch for changes to keys.

(2) Mount and superblock notifications.

This builds on (1) to provide notifications of mount topology changes
and implements a framework for superblock events (configuration
changes, I/O errors, quota/space overruns and network status changes).

(3) Filesystem information retrieval.

This provides an extensible way to retrieve informational attributes
about mount objects and filesystems. This includes providing
information intended to make recovering from a notification queue
overrun much easier.

We need (1) for Gnome to efficiently watch for changes in kerberos
keyrings. Debarshi Ray has patches ready to go for gnome-online-accounts
so that it can make use of the facility.

Sets (2) and (3) can make libmount more efficient. Karel Zak is working on
making use of this to avoid reading /proc/mountinfo.

We need something to make systemd's watching of the mount topology more
efficient, and (2) and (3) can help with this by making it faster to narrow
down what changed. I think Karel has this in his sights, but hasn't yet
managed to work on it.

Set (2) should be able to make it easier to watch for mount options inside
a container, and set (3) should make it easier to examine the mounts inside
another mount namespace inside a container in a way that can't be done with
/proc/mounts. This is requested by Christian Brauner.

Jeff Layton has a tentative addition to (3) to expose error state to
userspace, and Andres Freund would like this for Postgres.

Set (3) further allows the information returned by such as statx() and
ioctl(FS_IOC_GETFLAGS) to be qualified by indicating which bits are/aren't
supported.

Further, for (3), I also allow filesystem-specific overrides/extensions to
fsinfo() and have a use for it to AFS to expose information about server
preference for a particular volume (something that is necessary for
implementing the toolset). I've provided example code that does similar
for NFS and some that exposes superblock info from Ext4. At Vault, Steve
expressed an interest in this for CIFS and Ted Ts'o expressed a possible
interest for Ext4.

Notes:

(*) These patches will conflict with apparently upcoming refactoring of
the security core, but the fixup doesn't look too bad:

https://lore.kernel.org/linux-next/[email protected]/T/#u

(*) Miklós Szeredi would much prefer to implement fsinfo() as a magic
filesystem mounted on /proc/self/fsinfo/ whereby your open fds appear
as directories under there, each with a set of attribute files
corresponding to the attributes that fsinfo() would otherwise provide.
To examine something by filename, you'd have to open it O_PATH and
then read the individual attribute files in the corresponding per-fd
directory. A readfile() system call has been mooted to elide the
{open,read,close} sequence to make it more efficient.

(*) James Bottomley would like to deprecate fsopen(), fspick(), fsconfig()
and fsmount() in favour of a more generic configfs with dedicated
open, set-config and action syscalls, with an additional get-config
syscall that would be used instead of fsinfo() - though, as I
understand it, you'd have to create a config (fspick-equivalent)
before you could use get-config.

(*) I don't think Al has particularly looked at fsinfo() or the fs
notifications patches yet.

(*) I'm not sure what *your* opinion of fsinfo() is yet. If you don't
dislike it too, um, fragrantly, would you be willing to entertain part
of it for now and prefer the rest to stew a bit longer? I can drop
some of the pieces.

Anyway, I'm going to formulate a pull request for each of them.

Thanks,
David


2020-03-30 20:30:53

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Mon, Mar 30, 2020 at 3:58 PM David Howells <[email protected]> wrote:
>
>
> Hi Linus,
>
> I have three sets of patches I'd like to push your way, if you (and Al) are
> willing to consider them.

The basic problem in my view, is that the performance requirement of a
"get filesystem information" type of API just does not warrant a
binary coded interface. I've said this a number of times, but it fell
on deaf ears.

Such binary ABIs (especially if not very carefully designed and
reviewed) usually go through several revisions as the structure fails
to account for future changes in the representation of those structure
fields. There are too many examples of this to count. Then there's
the problem of needing to update libc, utilities and language bindings
on each revision or extension of the interface.

All this could be solved with a string key/value representation of the
same data, with minimal performance loss on encoding/parsing. The
proposed fs interface[1] is one example of that, but I could also
imagine a syscall based one too.

Thanks,
Miklos

[1] https://lore.kernel.org/linux-fsdevel/[email protected]/

2020-03-30 21:17:46

by Christian Brauner

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

[Cc Lennart and Aleksa, both of which maintain projects too that would
make use of this]

On Mon, Mar 30, 2020 at 02:58:22PM +0100, David Howells wrote:
>
> Hi Linus,
>
> I have three sets of patches I'd like to push your way, if you (and Al) are
> willing to consider them.
>
> (1) General notification queue plus key/keyring notifications.
>
> This adds the core of the notification queue built on pipes, and adds
> the ability to watch for changes to keys.
>
> (2) Mount and superblock notifications.
>
> This builds on (1) to provide notifications of mount topology changes
> and implements a framework for superblock events (configuration
> changes, I/O errors, quota/space overruns and network status changes).
>
> (3) Filesystem information retrieval.
>
> This provides an extensible way to retrieve informational attributes
> about mount objects and filesystems. This includes providing
> information intended to make recovering from a notification queue
> overrun much easier.
>
> We need (1) for Gnome to efficiently watch for changes in kerberos
> keyrings. Debarshi Ray has patches ready to go for gnome-online-accounts
> so that it can make use of the facility.
>
> Sets (2) and (3) can make libmount more efficient. Karel Zak is working on
> making use of this to avoid reading /proc/mountinfo.
>
> We need something to make systemd's watching of the mount topology more
> efficient, and (2) and (3) can help with this by making it faster to narrow
> down what changed. I think Karel has this in his sights, but hasn't yet
> managed to work on it.
>
> Set (2) should be able to make it easier to watch for mount options inside
> a container, and set (3) should make it easier to examine the mounts inside
> another mount namespace inside a container in a way that can't be done with
> /proc/mounts. This is requested by Christian Brauner.
>
> Jeff Layton has a tentative addition to (3) to expose error state to
> userspace, and Andres Freund would like this for Postgres.
>
> Set (3) further allows the information returned by such as statx() and
> ioctl(FS_IOC_GETFLAGS) to be qualified by indicating which bits are/aren't
> supported.
>
> Further, for (3), I also allow filesystem-specific overrides/extensions to
> fsinfo() and have a use for it to AFS to expose information about server
> preference for a particular volume (something that is necessary for
> implementing the toolset). I've provided example code that does similar
> for NFS and some that exposes superblock info from Ext4. At Vault, Steve
> expressed an interest in this for CIFS and Ted Ts'o expressed a possible
> interest for Ext4.
>
> Notes:
>
> (*) These patches will conflict with apparently upcoming refactoring of
> the security core, but the fixup doesn't look too bad:
>
> https://lore.kernel.org/linux-next/[email protected]/T/#u
>
> (*) Miklós Szeredi would much prefer to implement fsinfo() as a magic
> filesystem mounted on /proc/self/fsinfo/ whereby your open fds appear
> as directories under there, each with a set of attribute files
> corresponding to the attributes that fsinfo() would otherwise provide.
> To examine something by filename, you'd have to open it O_PATH and
> then read the individual attribute files in the corresponding per-fd
> directory. A readfile() system call has been mooted to elide the
> {open,read,close} sequence to make it more efficient.

Fwiw, putting down my kernel hat and speaking as someone who maintains
two container runtimes and various other low-level bits and pieces in
userspace who'd make heavy use of this stuff I would prefer the fd-based
fsinfo() approach especially in the light of across namespace
operations, querying all properties of a mount atomically all-at-once,
and safe delegation through fds. Another heavy user of this would be
systemd (Cced Lennart who I've discussed this with) which would prefer
the fd-based approach as well. I think pulling this into a filesystem
and making userspace parse around in a filesystem tree to query mount
information is the wrong approach and will get messy pretty quickly
especially in the face of mount and user namespace interactions and
various other pitfalls. fsinfo() fits quite nicely with the all-fd-based
approach of the whole mount api. So yes, definitely preferred from my
end.

Christian

2020-03-31 05:12:20

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner
<[email protected]> wrote:

> Fwiw, putting down my kernel hat and speaking as someone who maintains
> two container runtimes and various other low-level bits and pieces in
> userspace who'd make heavy use of this stuff I would prefer the fd-based
> fsinfo() approach especially in the light of across namespace
> operations, querying all properties of a mount atomically all-at-once,

fsinfo(2) doesn't meet the atomically all-at-once requirement. Sure,
it's possible to check the various change counters before and after a
batch of calls to check that the result is consistent. Still, that's
not an atomic all-at-once query, if you'd really require that, than
fsinfo(2) as it currently stands would be inadequate.

> and safe delegation through fds. Another heavy user of this would be
> systemd (Cced Lennart who I've discussed this with) which would prefer
> the fd-based approach as well. I think pulling this into a filesystem
> and making userspace parse around in a filesystem tree to query mount
> information is the wrong approach and will get messy pretty quickly
> especially in the face of mount and user namespace interactions and
> various other pitfalls.

Have you actually looked at my proposed patch? Do you have concrete
issues or just vague bad feelings?

Thanks,
Miklos

2020-03-31 07:30:24

by Lennart Poettering

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Mo, 30.03.20 23:17, Christian Brauner ([email protected]) wrote:

> Fwiw, putting down my kernel hat and speaking as someone who maintains
> two container runtimes and various other low-level bits and pieces in
> userspace who'd make heavy use of this stuff I would prefer the fd-based
> fsinfo() approach especially in the light of across namespace
> operations, querying all properties of a mount atomically all-at-once,
> and safe delegation through fds. Another heavy user of this would be
> systemd (Cced Lennart who I've discussed this with) which would prefer
> the fd-based approach as well. I think pulling this into a filesystem
> and making userspace parse around in a filesystem tree to query mount
> information is the wrong approach and will get messy pretty quickly
> especially in the face of mount and user namespace interactions and
> various other pitfalls. fsinfo() fits quite nicely with the all-fd-based
> approach of the whole mount api. So yes, definitely preferred from my
> end.

Christian is right. I think it's very important to have an API that
allows to query the state of fs attributes in a consistent state,
i.e. so that the attributes userspace is interested in can be queried
in a single call, so they all describe the very same point in
time. Distributing attributes onto multiple individual files just
sucks, because it's then guaranteed that you never can read them in a
way they actually fit together, some attributes you read will be
older, others newer. It's a big design flaw of sysfs (which is
structured like this) if you ask me.

I don't really care if the kernel API for this is binary or
textual. Slight preference for binary, but I don't care too much.

I think it would be wise to bind such APIs to fds, simply because it
always works. Doing path based stuff sucks, because you always need to
mount stuff and have a path tree set up, which is less ideal in a
world where namespacing is common, and namespaces are a shared concept
(at least with your other threads, if not with other processes). As a
maintainer of an init system I really dislike APIs that I can only use
after a mount structure has been set up, too often we want to do stuff
before that. Moreover, philosophically I find it questionnable to use
path based APIs to interface with the path object hierarchy itself. it
feels "too recursive". Just keep this separate: build stuff on top of
the fs that fits on top of the fs, but don't build fs APIs on top of
fs APIs that stem from the same layer.

Summary: atomic APIs rock, fd-based APIs rock. APIs built on
individual files one can only read individually suck. APIs of the path
layer exposed in the path layer suck.

Hope this makes some sense?

Lennart

2020-03-31 08:16:01

by Christian Brauner

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote:
> On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner
> <[email protected]> wrote:
>
> > Fwiw, putting down my kernel hat and speaking as someone who maintains
> > two container runtimes and various other low-level bits and pieces in
> > userspace who'd make heavy use of this stuff I would prefer the fd-based
> > fsinfo() approach especially in the light of across namespace
> > operations, querying all properties of a mount atomically all-at-once,
>
> fsinfo(2) doesn't meet the atomically all-at-once requirement. Sure,
> it's possible to check the various change counters before and after a
> batch of calls to check that the result is consistent. Still, that's
> not an atomic all-at-once query, if you'd really require that, than
> fsinfo(2) as it currently stands would be inadequate.

It at all that's only true for batch requests.

>
> > and safe delegation through fds. Another heavy user of this would be
> > systemd (Cced Lennart who I've discussed this with) which would prefer
> > the fd-based approach as well. I think pulling this into a filesystem
> > and making userspace parse around in a filesystem tree to query mount
> > information is the wrong approach and will get messy pretty quickly
> > especially in the face of mount and user namespace interactions and
> > various other pitfalls.
>
> Have you actually looked at my proposed patch? Do you have concrete

Yes. So have others, Al actively disliked and nacked it and no-one got
excited about it.

> issues or just vague bad feelings?

We have had that discussion on-list where I made my "vague bad feelings"
clear where you responded with the same dismissive style so I don't see
the point in repeating this experience.

Again, I want to make it clear that here I'm stating my preference as a
user of this api and as such I don't want to have to parse through a
filesystem to get complex information about filesystems. We've had
fruitful discussions [1] around how fsinfo() ties in with supervised
mounts and the rest of the mount api and its clear and simple especially
in the face of namespaces and implements a nice delegation model. So +1
from me.

Christian

[1]: https://youtu.be/LN2CUgp8deo?t=6840

2020-03-31 08:35:40

by Karel Zak

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote:
> On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner
> <[email protected]> wrote:
>
> > Fwiw, putting down my kernel hat and speaking as someone who maintains
> > two container runtimes and various other low-level bits and pieces in
> > userspace who'd make heavy use of this stuff I would prefer the fd-based
> > fsinfo() approach especially in the light of across namespace
> > operations, querying all properties of a mount atomically all-at-once,
>
> fsinfo(2) doesn't meet the atomically all-at-once requirement.

I guess your /proc based idea have exactly the same problem...

I see two possible ways:

- after open("/mnt", O_PATH) create copy-on-write object in kernel to
represent mount node -- kernel will able to modify it, but userspace
will get unchanged data from the FD until to close()

- improve fsinfo() to provide set (list) of the attributes by one call

Karel

--
Karel Zak <[email protected]>
http://karelzak.blogspot.com

2020-03-31 08:37:01

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Tue, Mar 31, 2020 at 10:15 AM Christian Brauner
<[email protected]> wrote:
>
> On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote:
> > On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner
> > <[email protected]> wrote:
> >
> > > Fwiw, putting down my kernel hat and speaking as someone who maintains
> > > two container runtimes and various other low-level bits and pieces in
> > > userspace who'd make heavy use of this stuff I would prefer the fd-based
> > > fsinfo() approach especially in the light of across namespace
> > > operations, querying all properties of a mount atomically all-at-once,
> >
> > fsinfo(2) doesn't meet the atomically all-at-once requirement. Sure,
> > it's possible to check the various change counters before and after a
> > batch of calls to check that the result is consistent. Still, that's
> > not an atomic all-at-once query, if you'd really require that, than
> > fsinfo(2) as it currently stands would be inadequate.
>
> It at all that's only true for batch requests.

For example, there's no way to atomically query mount flags, parent,
and list of children with a single fsinfo() call, you actually need
three calls and they can reflect different states of the same mount.
Not saying this is a problem, just that there's no list of
requirements on what is needed and why.

> > > and safe delegation through fds. Another heavy user of this would be
> > > systemd (Cced Lennart who I've discussed this with) which would prefer
> > > the fd-based approach as well. I think pulling this into a filesystem
> > > and making userspace parse around in a filesystem tree to query mount
> > > information is the wrong approach and will get messy pretty quickly
> > > especially in the face of mount and user namespace interactions and
> > > various other pitfalls.
> >
> > Have you actually looked at my proposed patch? Do you have concrete
>
> Yes. So have others, Al actively disliked and nacked it and no-one got
> excited about it.

Al, as far as I remember, nacked several things the patch was doing.
I fixed those.

> > issues or just vague bad feelings?
>
> We have had that discussion on-list where I made my "vague bad feelings"
> clear where you responded with the same dismissive style so I don't see
> the point in repeating this experience.
>
> Again, I want to make it clear that here I'm stating my preference as a
> user of this api and as such I don't want to have to parse through a
> filesystem to get complex information about filesystems. We've had
> fruitful discussions [1] around how fsinfo() ties in with supervised
> mounts and the rest of the mount api and its clear and simple especially
> in the face of namespaces and implements a nice delegation model. So +1
> from me.

And you keep ignoring the fact that my patch implements that exact
same delegation model. That's why I'm asking if you have looked at it
or not.

The "I don't want to have to parse through a filesystem to get complex
information about filesystems" is not a set of requirements that an
API can be designed from.

Thanks,
Miklos

2020-03-31 08:57:36

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Tue, Mar 31, 2020 at 10:34 AM Karel Zak <[email protected]> wrote:
>
> On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote:
> > On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner
> > <[email protected]> wrote:
> >
> > > Fwiw, putting down my kernel hat and speaking as someone who maintains
> > > two container runtimes and various other low-level bits and pieces in
> > > userspace who'd make heavy use of this stuff I would prefer the fd-based
> > > fsinfo() approach especially in the light of across namespace
> > > operations, querying all properties of a mount atomically all-at-once,
> >
> > fsinfo(2) doesn't meet the atomically all-at-once requirement.
>
> I guess your /proc based idea have exactly the same problem...

Yes, that's exactly what I wanted to demonstrate: there's no
fundamental difference between the two API's in this respect.

> I see two possible ways:
>
> - after open("/mnt", O_PATH) create copy-on-write object in kernel to
> represent mount node -- kernel will able to modify it, but userspace
> will get unchanged data from the FD until to close()
>
> - improve fsinfo() to provide set (list) of the attributes by one call

I think we are approaching this from the wrong end. Let's just
ignore all of the proposed interfaces for now and only concentrate on
what this will be used for.

Start with a set of use cases by all interested parties. E.g.

- systemd wants to keep track attached mounts in a namespace, as well
as new detached mounts created by fsmount()

- systemd need to keep information (such as parent, children, mount
flags, fs options, etc) up to date on any change of topology or
attributes.

- util linux needs to display the topology and state of mounts in the
system that corresponds to a consistent state that set of mounts

- etc...

From that we can derive a set of requirements for the API.

Thanks,
Miklos

2020-03-31 09:22:20

by Karel Zak

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Mon, Mar 30, 2020 at 10:28:56PM +0200, Miklos Szeredi wrote:
> All this could be solved with a string key/value representation of the
> same data, with minimal performance loss on encoding/parsing. The
> proposed fs interface[1] is one example of that, but I could also
> imagine a syscall based one too.

Yes, key/value is possible solution. The question is if we really
need to add extra /sys-like filesystem to get key/value ;-) I can
imagine key/value from FD based interface without open/read/close for
each attribute,

fd = open("/mnt", O_PATH);
fsinfo(fd, "propagation", buf, sizeof(buf));
fsinfo(fd, "fstype", buf, sizeof(buf));
close(fd);

why I need /mountfs/<id>/propagation and /mountfs/<id>/fstype to get
the same? It sounds like over-engineering without any extra bonus.

Anyway, if we have FD based interfaces like fsopen(), fsmount(),
open_tree() and move_mount() then it sounds strange that you cannot
use the FD to ask kernel for the mount node attributes and you need
to open and read another /sys-like files.

IMHO it would be nice that after open(/mnt, O_PATH) I can do whatever
with the mount point (umount, move, reconfigure, query, etc.). Please,
try to keep it simple and consistent ;-)

Karel

--
Karel Zak <[email protected]>
http://karelzak.blogspot.com

2020-03-31 09:50:21

by Karel Zak

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Tue, Mar 31, 2020 at 10:56:35AM +0200, Miklos Szeredi wrote:
> I think we are approaching this from the wrong end. Let's just
> ignore all of the proposed interfaces for now and only concentrate on
> what this will be used for.
>
> Start with a set of use cases by all interested parties. E.g.
>
> - systemd wants to keep track attached mounts in a namespace, as well
> as new detached mounts created by fsmount()
>
> - systemd need to keep information (such as parent, children, mount
> flags, fs options, etc) up to date on any change of topology or
> attributes.
>
> - util linux needs to display the topology and state of mounts in the
> system that corresponds to a consistent state that set of mounts

- like systemd we also need in mount/umount to query one mountpoint
rather than parse all /proc/self/mountinfo

Karel

--
Karel Zak <[email protected]>
http://karelzak.blogspot.com

2020-03-31 12:26:45

by Lennart Poettering

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Di, 31.03.20 10:56, Miklos Szeredi ([email protected]) wrote:

> On Tue, Mar 31, 2020 at 10:34 AM Karel Zak <[email protected]> wrote:
> >
> > On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote:
> > > On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner
> > > <[email protected]> wrote:
> > >
> > > > Fwiw, putting down my kernel hat and speaking as someone who maintains
> > > > two container runtimes and various other low-level bits and pieces in
> > > > userspace who'd make heavy use of this stuff I would prefer the fd-based
> > > > fsinfo() approach especially in the light of across namespace
> > > > operations, querying all properties of a mount atomically all-at-once,
> > >
> > > fsinfo(2) doesn't meet the atomically all-at-once requirement.
> >
> > I guess your /proc based idea have exactly the same problem...
>
> Yes, that's exactly what I wanted to demonstrate: there's no
> fundamental difference between the two API's in this respect.
>
> > I see two possible ways:
> >
> > - after open("/mnt", O_PATH) create copy-on-write object in kernel to
> > represent mount node -- kernel will able to modify it, but userspace
> > will get unchanged data from the FD until to close()
> >
> > - improve fsinfo() to provide set (list) of the attributes by one call
>
> I think we are approaching this from the wrong end. Let's just
> ignore all of the proposed interfaces for now and only concentrate on
> what this will be used for.
>
> Start with a set of use cases by all interested parties. E.g.
>
> - systemd wants to keep track attached mounts in a namespace, as well
> as new detached mounts created by fsmount()
>
> - systemd need to keep information (such as parent, children, mount
> flags, fs options, etc) up to date on any change of topology or
> attributes.

- We also have code that recursively remounts r/o or unmounts some
directory tree (with filters), which is currently nasty to do since
the relationships between dirs are not always clear from
/proc/self/mountinfo alone, in particular not in an even remotely
atomic fashion, or when stuff is overmounted.

- We also have code that needs to check if /dev/ is plain tmpfs or
devtmpfs. We cannot use statfs for that, since in both cases
TMPFS_MAGIC is reported, hence we currently parse
/proc/self/mountinfo for that to find the fstype string there, which
is different for both cases.

Lennart

--
Lennart Poettering, Berlin

2020-03-31 15:12:24

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Tue, Mar 31, 2020 at 2:25 PM Lennart Poettering <[email protected]> wrote:
>
> On Di, 31.03.20 10:56, Miklos Szeredi ([email protected]) wrote:
>
> > On Tue, Mar 31, 2020 at 10:34 AM Karel Zak <[email protected]> wrote:
> > >
> > > On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote:
> > > > On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner
> > > > <[email protected]> wrote:
> > > >
> > > > > Fwiw, putting down my kernel hat and speaking as someone who maintains
> > > > > two container runtimes and various other low-level bits and pieces in
> > > > > userspace who'd make heavy use of this stuff I would prefer the fd-based
> > > > > fsinfo() approach especially in the light of across namespace
> > > > > operations, querying all properties of a mount atomically all-at-once,
> > > >
> > > > fsinfo(2) doesn't meet the atomically all-at-once requirement.
> > >
> > > I guess your /proc based idea have exactly the same problem...
> >
> > Yes, that's exactly what I wanted to demonstrate: there's no
> > fundamental difference between the two API's in this respect.
> >
> > > I see two possible ways:
> > >
> > > - after open("/mnt", O_PATH) create copy-on-write object in kernel to
> > > represent mount node -- kernel will able to modify it, but userspace
> > > will get unchanged data from the FD until to close()
> > >
> > > - improve fsinfo() to provide set (list) of the attributes by one call
> >
> > I think we are approaching this from the wrong end. Let's just
> > ignore all of the proposed interfaces for now and only concentrate on
> > what this will be used for.
> >
> > Start with a set of use cases by all interested parties. E.g.
> >
> > - systemd wants to keep track attached mounts in a namespace, as well
> > as new detached mounts created by fsmount()
> >
> > - systemd need to keep information (such as parent, children, mount
> > flags, fs options, etc) up to date on any change of topology or
> > attributes.
>
> - We also have code that recursively remounts r/o or unmounts some
> directory tree (with filters),

Recursive remount-ro is clear. What is not clear is whether you need
to do this for hidden mounts (not possible from userspace without a
way to disable mount following on path lookup). Would it make sense
to add a kernel API for recursive setting of mount flags?

What exactly is this unmount with filters? Can you give examples?

> - We also have code that needs to check if /dev/ is plain tmpfs or
> devtmpfs. We cannot use statfs for that, since in both cases
> TMPFS_MAGIC is reported, hence we currently parse
> /proc/self/mountinfo for that to find the fstype string there, which
> is different for both cases.

Okay.

Thanks,
Miklos

2020-03-31 15:26:32

by Lennart Poettering

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Di, 31.03.20 17:10, Miklos Szeredi ([email protected]) wrote:

> On Tue, Mar 31, 2020 at 2:25 PM Lennart Poettering <[email protected]> wrote:
> >
> > On Di, 31.03.20 10:56, Miklos Szeredi ([email protected]) wrote:
> >
> > > On Tue, Mar 31, 2020 at 10:34 AM Karel Zak <[email protected]> wrote:
> > > >
> > > > On Tue, Mar 31, 2020 at 07:11:11AM +0200, Miklos Szeredi wrote:
> > > > > On Mon, Mar 30, 2020 at 11:17 PM Christian Brauner
> > > > > <[email protected]> wrote:
> > > > >
> > > > > > Fwiw, putting down my kernel hat and speaking as someone who maintains
> > > > > > two container runtimes and various other low-level bits and pieces in
> > > > > > userspace who'd make heavy use of this stuff I would prefer the fd-based
> > > > > > fsinfo() approach especially in the light of across namespace
> > > > > > operations, querying all properties of a mount atomically all-at-once,
> > > > >
> > > > > fsinfo(2) doesn't meet the atomically all-at-once requirement.
> > > >
> > > > I guess your /proc based idea have exactly the same problem...
> > >
> > > Yes, that's exactly what I wanted to demonstrate: there's no
> > > fundamental difference between the two API's in this respect.
> > >
> > > > I see two possible ways:
> > > >
> > > > - after open("/mnt", O_PATH) create copy-on-write object in kernel to
> > > > represent mount node -- kernel will able to modify it, but userspace
> > > > will get unchanged data from the FD until to close()
> > > >
> > > > - improve fsinfo() to provide set (list) of the attributes by one call
> > >
> > > I think we are approaching this from the wrong end. Let's just
> > > ignore all of the proposed interfaces for now and only concentrate on
> > > what this will be used for.
> > >
> > > Start with a set of use cases by all interested parties. E.g.
> > >
> > > - systemd wants to keep track attached mounts in a namespace, as well
> > > as new detached mounts created by fsmount()
> > >
> > > - systemd need to keep information (such as parent, children, mount
> > > flags, fs options, etc) up to date on any change of topology or
> > > attributes.
> >
> > - We also have code that recursively remounts r/o or unmounts some
> > directory tree (with filters),
>
> Recursive remount-ro is clear. What is not clear is whether you need
> to do this for hidden mounts (not possible from userspace without a
> way to disable mount following on path lookup). Would it make sense
> to add a kernel API for recursive setting of mount flags?

I would be very happy about an explicit kernel API for recursively
toggling the MS_RDONLY. But for many usecases in systemd we need the
ability to filter some subdirs and leave them as is, so while helpful
we'd have to keep the userspace code we currently have anyway.

> What exactly is this unmount with filters? Can you give examples?

Hmm, actually it's only the r/o remount that has filters, not the
unmount. Sorry for the confusion. And the r/o remount with filters
just means: "remount everything below X read-only except for X/Y and
X/Z/A"...

Lennart

--
Lennart Poettering, Berlin

2020-03-31 17:32:12

by David Howells

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

Miklos Szeredi <[email protected]> wrote:

> The basic problem in my view, is that the performance requirement of a
> "get filesystem information" type of API just does not warrant a
> binary coded interface. I've said this a number of times, but it fell
> on deaf ears.

It hasn't so fallen, but don't necessarily agree with you. Let's pin some
numbers on this.

Using what I think is your latest patch where you look up
/proc/<pid>/fdinfo/<fd> to find a file that gives a summary of some
information in "key: val" format, including a mount ID. You then have to look
in a mounted mountfs magic filesystem for a directory corresponding mount ID
that has a bunch of attribute files in it, most with a single attribute value.

What I can do with it is, say, look up the mount ID of the object attached to
a path - but that's about all because it doesn't implement anything like
look-up-by-mount-ID or list-children.

Attached is a kernel patch, supplementary to the fsinfo patchset, that adds
your implementation, fixed for coexistence with the mount notifications code,
plus a sample program that creates N mounts and then sees how long it takes to
query each of those mounts for its mnt_id by four different methods:

(1) "f" - Use fsinfo, looking up each mount root directly by path.

(2) "f2" - Use fsinfo, firstly using fsinfo() to look up the base mount by
path, then use fsinfo() to get a list of all the children of that mount
(which in fact gives me the mnt_id, but ignoring that), and then call
fsinfo() by mount ID for each child to get its information, including its
mnt_id.

(3) "p" - Open the root of each mount with O_PATH and then open and read the
procfile to retrieve information, then parse the received text to find
the line with that key, then parse the line to get the number, allowing
for the possibility that the line might acquire extra numbers.

(4) "p2" - Open the root of the base mount with O_PATH, then read the
appropriate file in /proc/fdinfo to find the base mount ID. Open "/mnt"
O_PATH to use as a base. Then read <mntid>/children and parse the list
to find each child. Each child's <mntid>/id file is then read.

Run the program like:

mount -t mountfs none /mnt
mkdir /tmp/a
./test-fsinfo-perf /tmp/a 20000

Note that it detaches its base mount afterwards and lets it get cleaned up and
systemd goes crazy for a bit. Note also that it prints the sum of all the
mount IDs as a consistency check for each test.

Okay, the results:

For 1000 mounts, f= 1514us f2= 1102us p= 6014us p2= 6935us; p=4.0*f p=5.5*f2 p=0.9*p2
For 2000 mounts, f= 4712us f2= 3675us p= 20937us p2= 22878us; p=4.4*f p=5.7*f2 p=0.9*p2
For 3000 mounts, f= 6795us f2= 5304us p= 31080us p2= 34056us; p=4.6*f p=5.9*f2 p=0.9*p2
For 4000 mounts, f= 9291us f2= 7434us p= 40723us p2= 46479us; p=4.4*f p=5.5*f2 p=0.9*p2
For 5000 mounts, f=11423us f2= 9219us p= 50878us p2= 58857us; p=4.5*f p=5.5*f2 p=0.9*p2
For 10000 mounts, f=22899us f2=18240us p=101054us p2=117273us; p=4.4*f p=5.5*f2 p=0.9*p2
For 20000 mounts, f=45811us f2=37211us p=203640us p2=237377us; p=4.4*f p=5.5*f2 p=0.9*p2
For 30000 mounts, f=69703us f2=54800us p=306778us p2=357629us; p=4.4*f p=5.6*f2 p=0.9*p2

The number of mounts doesn't have an effect - not surprising with direct
pathwalk-based approaches ("f" and "p") since the pathwalk part is the same in
both cases, though in one fsinfo() does it and in the other, open(O_PATH).

As you can see, your procfs-based approach takes consistently about 4.4 times
as long as fsinfo(QUERY_PATH) and 5.5 times as long as fsinfo(QUERY_MOUNT).

Going through mountfs ("p2") is even slower than going through procfs, though
this really ought to be comparable to fsinfo-by-mount-ID ("f2"), but the
latter is something like 6.5x faster.

I suspect the procfs-based and mountfs-based approaches suffer from creating
lots of inodes, dentries and file structs as you access the files. This also
means that they use more live state memory - and I think it lingers - if you
start using them, whereas fsinfo() uses none at all, beyond whatever is used
by the pathwalk to find the object to query (if you go that route).

mountfs is going to be worse also if you want more than one value if you
persist in putting one attribute in each file.

David
---
commit ed109ef4351d44a3e881e6518a207431113c17c0
Author: David Howells <[email protected]>
Date: Tue Mar 31 14:39:07 2020 +0100

Performance test Miklós's patch vs fsinfo

diff --git a/fs/Makefile b/fs/Makefile
index b6bf2424c7f7..ac0627176db1 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -137,3 +137,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/
obj-$(CONFIG_EROFS_FS) += erofs/
obj-$(CONFIG_VBOXSF_FS) += vboxsf/
obj-$(CONFIG_ZONEFS_FS) += zonefs/
+obj-y += mountfs/
diff --git a/fs/mount.h b/fs/mount.h
index 063f41bc2e93..89b091fc482f 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -82,6 +82,7 @@ struct mount {
atomic_t mnt_subtree_notifications; /* Number of notifications in subtree */
struct watch_list *mnt_watchers; /* Watches on dentries within this mount */
#endif
+ struct mountfs_entry *mnt_mountfs_entry;
} __randomize_layout;

#define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */
@@ -177,3 +178,11 @@ static inline void notify_mount(struct mount *triggered,
{
}
#endif
+
+void mnt_namespace_lock_read(void);
+void mnt_namespace_unlock_read(void);
+
+void mountfs_create(struct mount *mnt);
+extern void mountfs_remove(struct mount *mnt);
+int mountfs_lookup_internal(struct vfsmount *m, struct path *path);
+
diff --git a/fs/mountfs/Makefile b/fs/mountfs/Makefile
new file mode 100644
index 000000000000..35a65e9a966f
--- /dev/null
+++ b/fs/mountfs/Makefile
@@ -0,0 +1 @@
+obj-y += super.o
diff --git a/fs/mountfs/super.c b/fs/mountfs/super.c
new file mode 100644
index 000000000000..82c01eb6154d
--- /dev/null
+++ b/fs/mountfs/super.c
@@ -0,0 +1,502 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "../pnode.h"
+#include <linux/fs.h>
+#include <linux/kref.h>
+#include <linux/nsproxy.h>
+#include <linux/fs_struct.h>
+#include <linux/fs_context.h>
+
+#define MOUNTFS_SUPER_MAGIC 0x4e756f4d
+
+static DEFINE_SPINLOCK(mountfs_lock);
+static struct rb_root mountfs_entries = RB_ROOT;
+static struct vfsmount *mountfs_mnt __read_mostly;
+
+struct mountfs_entry {
+ struct kref kref;
+ struct mount *mnt;
+ struct rb_node node;
+ int id;
+};
+
+static const char *mountfs_attrs[] = {
+ "root", "mountpoint", "id", "parent", "options", "children",
+ "group", "master", "propagate_from"
+};
+
+#define MOUNTFS_INO(id) (((unsigned long) id + 1) * \
+ (ARRAY_SIZE(mountfs_attrs) + 1))
+
+void mountfs_entry_release(struct kref *kref)
+{
+ kfree(container_of(kref, struct mountfs_entry, kref));
+}
+
+void mountfs_entry_put(struct mountfs_entry *entry)
+{
+ kref_put(&entry->kref, mountfs_entry_release);
+}
+
+static bool mountfs_entry_visible(struct mountfs_entry *entry)
+{
+ struct mount *mnt;
+ bool visible = false;
+
+ rcu_read_lock();
+ mnt = rcu_dereference(entry->mnt);
+ if (mnt && mnt->mnt_ns == current->nsproxy->mnt_ns)
+ visible = true;
+ rcu_read_unlock();
+
+ return visible;
+}
+static int mountfs_attr_show(struct seq_file *sf, void *v)
+{
+ const char *name = sf->file->f_path.dentry->d_name.name;
+ struct mountfs_entry *entry = sf->private;
+ struct mount *mnt;
+ struct vfsmount *m;
+ struct super_block *sb;
+ struct path root;
+ int tmp, err = -ENODEV;
+
+ mnt_namespace_lock_read();
+
+ mnt = entry->mnt;
+ if (!mnt || !mnt->mnt_ns)
+ goto out;
+
+ err = 0;
+ m = &mnt->mnt;
+ sb = m->mnt_sb;
+
+ if (strcmp(name, "root") == 0) {
+ if (sb->s_op->show_path) {
+ err = sb->s_op->show_path(sf, m->mnt_root);
+ } else {
+ seq_dentry(sf, m->mnt_root, " \t\n\\");
+ }
+ seq_putc(sf, '\n');
+ } else if (strcmp(name, "mountpoint") == 0) {
+ struct path mnt_path = { .dentry = m->mnt_root, .mnt = m };
+
+ get_fs_root(current->fs, &root);
+ err = seq_path_root(sf, &mnt_path, &root, " \t\n\\");
+ if (err == SEQ_SKIP) {
+ seq_puts(sf, "(unreachable)");
+ err = 0;
+ }
+ seq_putc(sf, '\n');
+ path_put(&root);
+ } else if (strcmp(name, "id") == 0) {
+ seq_printf(sf, "%i\n", mnt->mnt_id);
+ } else if (strcmp(name, "parent") == 0) {
+ tmp = rcu_dereference(mnt->mnt_parent)->mnt_id;
+ seq_printf(sf, "%i\n", tmp);
+ } else if (strcmp(name, "options") == 0) {
+ int mnt_flags = READ_ONCE(m->mnt_flags);
+
+ seq_puts(sf, mnt_flags & MNT_READONLY ? "ro" : "rw");
+ seq_mnt_opts(sf, mnt_flags);
+ seq_putc(sf, '\n');
+ } else if (strcmp(name, "children") == 0) {
+ struct mount *child;
+ bool first = true;
+
+ list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
+ if (!first)
+ seq_putc(sf, ',');
+ else
+ first = false;
+ seq_printf(sf, "%i", child->mnt_id);
+ }
+ if (!first)
+ seq_putc(sf, '\n');
+ } else if (strcmp(name, "group") == 0) {
+ if (IS_MNT_SHARED(mnt))
+ seq_printf(sf, "%i\n", mnt->mnt_group_id);
+ } else if (strcmp(name, "master") == 0) {
+ if (IS_MNT_SLAVE(mnt)) {
+ tmp = rcu_dereference(mnt->mnt_master)->mnt_group_id;
+ seq_printf(sf, "%i\n", tmp);
+ }
+ } else if (strcmp(name, "propagate_from") == 0) {
+ if (IS_MNT_SLAVE(mnt)) {
+ get_fs_root(current->fs, &root);
+ tmp = get_dominating_id(mnt, &root);
+ if (tmp)
+ seq_printf(sf, "%i\n", tmp);
+ }
+ } else {
+ WARN_ON(1);
+ err = -EIO;
+ }
+out:
+ mnt_namespace_unlock_read();
+
+ return err;
+}
+
+static int mountfs_attr_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, mountfs_attr_show, inode->i_private);
+}
+
+static const struct file_operations mountfs_attr_fops = {
+ .open = mountfs_attr_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static struct mountfs_entry *mountfs_node_to_entry(struct rb_node *node)
+{
+ return rb_entry(node, struct mountfs_entry, node);
+}
+
+static struct rb_node **mountfs_find_node(int id, struct rb_node **parent)
+{
+ struct rb_node **link = &mountfs_entries.rb_node;
+
+ *parent = NULL;
+ while (*link) {
+ struct mountfs_entry *entry = mountfs_node_to_entry(*link);
+
+ *parent = *link;
+ if (id < entry->id)
+ link = &entry->node.rb_left;
+ else if (id > entry->id)
+ link = &entry->node.rb_right;
+ else
+ break;
+ }
+ return link;
+}
+
+void mountfs_create(struct mount *mnt)
+{
+ struct mountfs_entry *entry;
+ struct rb_node **link, *parent;
+
+ entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+ if (!entry) {
+ WARN(1, "failed to allocate mountfs entry");
+ return;
+ }
+ kref_init(&entry->kref);
+ entry->mnt = mnt;
+ entry->id = mnt->mnt_id;
+
+ spin_lock(&mountfs_lock);
+ link = mountfs_find_node(entry->id, &parent);
+ if (!WARN_ON(*link)) {
+ rb_link_node(&entry->node, parent, link);
+ rb_insert_color(&entry->node, &mountfs_entries);
+ mnt->mnt_mountfs_entry = entry;
+ } else {
+ kfree(entry);
+ }
+ spin_unlock(&mountfs_lock);
+}
+
+void mountfs_remove(struct mount *mnt)
+{
+ struct mountfs_entry *entry = mnt->mnt_mountfs_entry;
+
+ if (!entry)
+ return;
+ spin_lock(&mountfs_lock);
+ entry->mnt = NULL;
+ rb_erase(&entry->node, &mountfs_entries);
+ spin_unlock(&mountfs_lock);
+
+ mountfs_entry_put(entry);
+
+ mnt->mnt_mountfs_entry = NULL;
+}
+
+static struct mountfs_entry *mountfs_get_entry(const char *name)
+{
+ struct mountfs_entry *entry = NULL;
+ struct rb_node **link, *dummy;
+ unsigned long mnt_id;
+ char buf[32];
+ int ret;
+
+ ret = kstrtoul(name, 10, &mnt_id);
+ if (ret || mnt_id > INT_MAX)
+ return NULL;
+
+ snprintf(buf, sizeof(buf), "%lu", mnt_id);
+ if (strcmp(buf, name) != 0)
+ return NULL;
+
+ spin_lock(&mountfs_lock);
+ link = mountfs_find_node(mnt_id, &dummy);
+ if (*link) {
+ entry = mountfs_node_to_entry(*link);
+ if (!mountfs_entry_visible(entry))
+ entry = NULL;
+ else
+ kref_get(&entry->kref);
+ }
+ spin_unlock(&mountfs_lock);
+
+ return entry;
+}
+
+static void mountfs_init_inode(struct inode *inode, umode_t mode);
+
+static struct dentry *mountfs_lookup_entry(struct dentry *dentry,
+ struct mountfs_entry *entry,
+ int idx)
+{
+ struct inode *inode;
+
+ inode = new_inode(dentry->d_sb);
+ if (!inode) {
+ mountfs_entry_put(entry);
+ return ERR_PTR(-ENOMEM);
+ }
+ inode->i_private = entry;
+ inode->i_ino = MOUNTFS_INO(entry->id) + idx;
+ mountfs_init_inode(inode, idx ? S_IFREG | 0444 : S_IFDIR | 0555);
+ return d_splice_alias(inode, dentry);
+
+}
+
+static struct dentry *mountfs_lookup(struct inode *dir, struct dentry *dentry,
+ unsigned int flags)
+{
+ struct mountfs_entry *entry = dir->i_private;
+ int i = 0;
+
+ if (entry) {
+ for (i = 0; i < ARRAY_SIZE(mountfs_attrs); i++)
+ if (strcmp(mountfs_attrs[i], dentry->d_name.name) == 0)
+ break;
+ if (i == ARRAY_SIZE(mountfs_attrs))
+ return ERR_PTR(-ENOMEM);
+ i++;
+ kref_get(&entry->kref);
+ } else {
+ entry = mountfs_get_entry(dentry->d_name.name);
+ if (!entry)
+ return ERR_PTR(-ENOENT);
+ }
+
+ return mountfs_lookup_entry(dentry, entry, i);
+}
+
+static int mountfs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+ struct mountfs_entry *entry = dentry->d_inode->i_private;
+
+ /* root: valid */
+ if (!entry)
+ return 1;
+
+ /* removed: invalid */
+ if (!entry->mnt)
+ return 0;
+
+ /* attribute or visible in this namespace: valid */
+ if (!d_can_lookup(dentry) || mountfs_entry_visible(entry))
+ return 1;
+
+ /* invlisible in this namespace: valid but deny entry*/
+ return -ENOENT;
+}
+
+static int mountfs_readdir(struct file *file, struct dir_context *ctx)
+{
+ struct rb_node *node;
+ struct mountfs_entry *entry = file_inode(file)->i_private;
+ char name[32];
+ const char *s;
+ unsigned int len, pos, id;
+
+ if (ctx->pos - 2 > INT_MAX || !dir_emit_dots(file, ctx))
+ return 0;
+
+ if (entry) {
+ while (ctx->pos - 2 < ARRAY_SIZE(mountfs_attrs)) {
+ s = mountfs_attrs[ctx->pos - 2];
+ if (!dir_emit(ctx, s, strlen(s),
+ MOUNTFS_INO(entry->id) + ctx->pos,
+ DT_REG))
+ break;
+ ctx->pos++;
+ }
+ return 0;
+ }
+
+ pos = ctx->pos - 2;
+ do {
+ spin_lock(&mountfs_lock);
+ mountfs_find_node(pos, &node);
+ pos = 1U + INT_MAX;
+ do {
+ if (!node) {
+ spin_unlock(&mountfs_lock);
+ goto out;
+ }
+ entry = mountfs_node_to_entry(node);
+ node = rb_next(node);
+ } while (!mountfs_entry_visible(entry));
+ if (node)
+ pos = mountfs_node_to_entry(node)->id;
+ id = entry->id;
+ spin_unlock(&mountfs_lock);
+
+ len = snprintf(name, sizeof(name), "%i", id);
+ ctx->pos = id + 2;
+ if (!dir_emit(ctx, name, len, MOUNTFS_INO(id), DT_DIR))
+ return 0;
+ } while (pos <= INT_MAX);
+out:
+ ctx->pos = pos + 2;
+ return 0;
+}
+
+int mountfs_lookup_internal(struct vfsmount *m, struct path *path)
+{
+ char name[32];
+ struct qstr this = { .name = name };
+ struct mount *mnt = real_mount(m);
+ struct mountfs_entry *entry = mnt->mnt_mountfs_entry;
+ struct dentry *dentry, *old, *root = mountfs_mnt->mnt_root;
+
+ this.len = snprintf(name, sizeof(name), "%i", mnt->mnt_id);
+ dentry = d_hash_and_lookup(root, &this);
+ if (dentry && dentry->d_inode->i_private != entry) {
+ d_invalidate(dentry);
+ dput(dentry);
+ dentry = NULL;
+ }
+ if (!dentry) {
+ dentry = d_alloc(root, &this);
+ if (!dentry)
+ return -ENOMEM;
+
+ kref_get(&entry->kref);
+ old = mountfs_lookup_entry(dentry, entry, 0);
+ if (old) {
+ dput(dentry);
+ if (IS_ERR(old))
+ return PTR_ERR(old);
+ dentry = old;
+ }
+ }
+
+ *path = (struct path) { .mnt = mountfs_mnt, .dentry = dentry };
+ return 0;
+}
+
+static const struct dentry_operations mountfs_dops = {
+ .d_revalidate = mountfs_d_revalidate,
+};
+
+static const struct inode_operations mountfs_iops = {
+ .lookup = mountfs_lookup,
+};
+
+static const struct file_operations mountfs_fops = {
+ .iterate_shared = mountfs_readdir,
+ .read = generic_read_dir,
+ .llseek = generic_file_llseek,
+};
+
+static void mountfs_init_inode(struct inode *inode, umode_t mode)
+{
+ inode->i_mode = mode;
+ inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
+ if (S_ISREG(mode)) {
+ inode->i_size = PAGE_SIZE;
+ inode->i_fop = &mountfs_attr_fops;
+ } else {
+ inode->i_op = &mountfs_iops;
+ inode->i_fop = &mountfs_fops;
+ }
+}
+
+static void mountfs_evict_inode(struct inode *inode)
+{
+ struct mountfs_entry *entry = inode->i_private;
+
+ clear_inode(inode);
+ if (entry)
+ mountfs_entry_put(entry);
+}
+
+static const struct super_operations mountfs_sops = {
+ .statfs = simple_statfs,
+ .drop_inode = generic_delete_inode,
+ .evict_inode = mountfs_evict_inode,
+};
+
+static int mountfs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+ struct inode *root;
+
+ sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV;
+ sb->s_blocksize = PAGE_SIZE;
+ sb->s_blocksize_bits = PAGE_SHIFT;
+ sb->s_magic = MOUNTFS_SUPER_MAGIC;
+ sb->s_time_gran = 1;
+ sb->s_shrink.seeks = 0;
+ sb->s_op = &mountfs_sops;
+ sb->s_d_op = &mountfs_dops;
+
+ root = new_inode(sb);
+ if (!root)
+ return -ENOMEM;
+
+ root->i_ino = 1;
+ mountfs_init_inode(root, S_IFDIR | 0444);
+
+ sb->s_root = d_make_root(root);
+ if (!sb->s_root)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static int mountfs_get_tree(struct fs_context *fc)
+{
+ return get_tree_single(fc, mountfs_fill_super);
+}
+
+static const struct fs_context_operations mountfs_context_ops = {
+ .get_tree = mountfs_get_tree,
+};
+
+static int mountfs_init_fs_context(struct fs_context *fc)
+{
+ fc->ops = &mountfs_context_ops;
+ fc->global = true;
+ return 0;
+}
+
+static struct file_system_type mountfs_fs_type = {
+ .name = "mountfs",
+ .init_fs_context = mountfs_init_fs_context,
+ .kill_sb = kill_anon_super,
+};
+
+static int __init mountfs_init(void)
+{
+ int err;
+
+ err = register_filesystem(&mountfs_fs_type);
+ if (!err) {
+ mountfs_mnt = kern_mount(&mountfs_fs_type);
+ if (IS_ERR(mountfs_mnt)) {
+ err = PTR_ERR(mountfs_mnt);
+ unregister_filesystem(&mountfs_fs_type);
+ }
+ }
+ return err;
+}
+fs_initcall(mountfs_init);
diff --git a/fs/namespace.c b/fs/namespace.c
index 5427e732c1bf..a05a2885a90e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -962,6 +962,8 @@ struct vfsmount *vfs_create_mount(struct fs_context *fc)

if (fc->sb_flags & SB_KERNMOUNT)
mnt->mnt.mnt_flags = MNT_INTERNAL;
+ else
+ mountfs_create(mnt);

atomic_inc(&fc->root->d_sb->s_active);
mnt->mnt.mnt_sb = fc->root->d_sb;
@@ -1033,7 +1035,7 @@ vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
}
EXPORT_SYMBOL_GPL(vfs_submount);

-static struct mount *clone_mnt(struct mount *old, struct dentry *root,
+static struct mount *clone_mnt_common(struct mount *old, struct dentry *root,
int flag)
{
struct super_block *sb = old->mnt.mnt_sb;
@@ -1100,6 +1102,17 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
return ERR_PTR(err);
}

+static struct mount *clone_mnt(struct mount *old, struct dentry *root,
+ int flag)
+{
+ struct mount *mnt = clone_mnt_common(old, root, flag);
+
+ if (!IS_ERR(mnt))
+ mountfs_create(mnt);
+
+ return mnt;
+}
+
static void cleanup_mnt(struct mount *mnt)
{
struct hlist_node *p;
@@ -1112,6 +1125,7 @@ static void cleanup_mnt(struct mount *mnt)
* so mnt_get_writers() below is safe.
*/
WARN_ON(mnt_get_writers(mnt));
+
if (unlikely(mnt->mnt_pins.first))
mnt_pin_kill(mnt);
hlist_for_each_entry_safe(m, p, &mnt->mnt_stuck_children, mnt_umount) {
@@ -1197,6 +1211,8 @@ static void mntput_no_expire(struct mount *mnt)
unlock_mount_hash();
shrink_dentry_list(&list);

+ mountfs_remove(mnt);
+
if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
struct task_struct *task = current;
if (likely(!(task->flags & PF_KTHREAD))) {
@@ -1263,13 +1279,14 @@ EXPORT_SYMBOL(path_is_mountpoint);
struct vfsmount *mnt_clone_internal(const struct path *path)
{
struct mount *p;
- p = clone_mnt(real_mount(path->mnt), path->dentry, CL_PRIVATE);
+ p = clone_mnt_common(real_mount(path->mnt), path->dentry, CL_PRIVATE);
if (IS_ERR(p))
return ERR_CAST(p);
p->mnt.mnt_flags |= MNT_INTERNAL;
return &p->mnt;
}

+
#ifdef CONFIG_PROC_FS
/* iterator; we want it to have access to namespace_sem, thus here... */
static void *m_start(struct seq_file *m, loff_t *pos)
@@ -1411,6 +1428,16 @@ static inline void namespace_lock(void)
down_write(&namespace_sem);
}

+void mnt_namespace_lock_read(void)
+{
+ down_read(&namespace_sem);
+}
+
+void mnt_namespace_unlock_read(void)
+{
+ up_read(&namespace_sem);
+}
+
enum umount_tree_flags {
UMOUNT_SYNC = 1,
UMOUNT_PROPAGATE = 2,
diff --git a/fs/proc/base.c b/fs/proc/base.c
index c7c64272b0fa..0477f8b51182 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3092,6 +3092,7 @@ static const struct pid_entry tgid_base_stuff[] = {
DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
DIR("map_files", S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
DIR("fdinfo", S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
+ DIR("fdmount", S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations),
DIR("ns", S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
#ifdef CONFIG_NET
DIR("net", S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
@@ -3497,6 +3498,7 @@ static const struct inode_operations proc_tid_comm_inode_operations = {
static const struct pid_entry tid_base_stuff[] = {
DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
DIR("fdinfo", S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
+ DIR("fdmount", S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations),
DIR("ns", S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
#ifdef CONFIG_NET
DIR("net", S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 81882a13212d..94a57e178801 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -361,3 +361,85 @@ const struct file_operations proc_fdinfo_operations = {
.iterate_shared = proc_readfdinfo,
.llseek = generic_file_llseek,
};
+
+static int proc_fdmount_link(struct dentry *dentry, struct path *path)
+{
+ struct files_struct *files = NULL;
+ struct task_struct *task;
+ struct path fd_path;
+ int ret = -ENOENT;
+
+ task = get_proc_task(d_inode(dentry));
+ if (task) {
+ files = get_files_struct(task);
+ put_task_struct(task);
+ }
+
+ if (files) {
+ unsigned int fd = proc_fd(d_inode(dentry));
+ struct file *fd_file;
+
+ spin_lock(&files->file_lock);
+ fd_file = fcheck_files(files, fd);
+ if (fd_file) {
+ fd_path = fd_file->f_path;
+ path_get(&fd_path);
+ ret = 0;
+ }
+ spin_unlock(&files->file_lock);
+ put_files_struct(files);
+ }
+ if (!ret) {
+ ret = mountfs_lookup_internal(fd_path.mnt, path);
+ path_put(&fd_path);
+ }
+
+ return ret;
+}
+
+static struct dentry *proc_fdmount_instantiate(struct dentry *dentry,
+ struct task_struct *task, const void *ptr)
+{
+ const struct fd_data *data = ptr;
+ struct proc_inode *ei;
+ struct inode *inode;
+
+ inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | 0400);
+ if (!inode)
+ return ERR_PTR(-ENOENT);
+
+ ei = PROC_I(inode);
+ ei->fd = data->fd;
+
+ inode->i_op = &proc_pid_link_inode_operations;
+ inode->i_size = 64;
+
+ ei->op.proc_get_link = proc_fdmount_link;
+ tid_fd_update_inode(task, inode, 0);
+
+ d_set_d_op(dentry, &tid_fd_dentry_operations);
+ return d_splice_alias(inode, dentry);
+}
+
+static struct dentry *
+proc_lookupfdmount(struct inode *dir, struct dentry *dentry, unsigned int flags)
+{
+ return proc_lookupfd_common(dir, dentry, proc_fdmount_instantiate);
+}
+
+static int proc_readfdmount(struct file *file, struct dir_context *ctx)
+{
+ return proc_readfd_common(file, ctx,
+ proc_fdmount_instantiate);
+}
+
+const struct inode_operations proc_fdmount_inode_operations = {
+ .lookup = proc_lookupfdmount,
+ .setattr = proc_setattr,
+};
+
+const struct file_operations proc_fdmount_operations = {
+ .read = generic_read_dir,
+ .iterate_shared = proc_readfdmount,
+ .llseek = generic_file_llseek,
+};
diff --git a/fs/proc/fd.h b/fs/proc/fd.h
index f371a602bf58..9e087c833e65 100644
--- a/fs/proc/fd.h
+++ b/fs/proc/fd.h
@@ -10,6 +10,9 @@ extern const struct inode_operations proc_fd_inode_operations;
extern const struct file_operations proc_fdinfo_operations;
extern const struct inode_operations proc_fdinfo_inode_operations;

+extern const struct file_operations proc_fdmount_operations;
+extern const struct inode_operations proc_fdmount_inode_operations;
+
extern int proc_fd_permission(struct inode *inode, int mask);

static inline unsigned int proc_fd(struct inode *inode)
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 273ee82d8aa9..e634faa9160e 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -61,24 +61,6 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
return security_sb_show_options(m, sb);
}

-static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
-{
- static const struct proc_fs_info mnt_info[] = {
- { MNT_NOSUID, ",nosuid" },
- { MNT_NODEV, ",nodev" },
- { MNT_NOEXEC, ",noexec" },
- { MNT_NOATIME, ",noatime" },
- { MNT_NODIRATIME, ",nodiratime" },
- { MNT_RELATIME, ",relatime" },
- { 0, NULL }
- };
- const struct proc_fs_info *fs_infop;
-
- for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) {
- if (mnt->mnt_flags & fs_infop->flag)
- seq_puts(m, fs_infop->str);
- }
-}

static inline void mangle(struct seq_file *m, const char *s)
{
@@ -120,7 +102,7 @@ static int show_vfsmnt(struct seq_file *m, struct vfsmount *mnt)
err = show_sb_opts(m, sb);
if (err)
goto out;
- show_mnt_opts(m, mnt);
+ seq_mnt_opts(m, mnt->mnt_flags);
if (sb->s_op->show_options)
err = sb->s_op->show_options(m, mnt_path.dentry);
seq_puts(m, " 0 0\n");
@@ -153,7 +135,7 @@ static int show_mountinfo(struct seq_file *m, struct vfsmount *mnt)
goto out;

seq_puts(m, mnt->mnt_flags & MNT_READONLY ? " ro" : " rw");
- show_mnt_opts(m, mnt);
+ seq_mnt_opts(m, mnt->mnt_flags);

/* Tagged fields ("foo:X" or "bar") */
if (IS_MNT_SHARED(r))
diff --git a/fs/seq_file.c b/fs/seq_file.c
index 1600034a929b..9726baba1732 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -15,6 +15,7 @@
#include <linux/cred.h>
#include <linux/mm.h>
#include <linux/printk.h>
+#include <linux/mount.h>
#include <linux/string_helpers.h>

#include <linux/uaccess.h>
@@ -548,6 +549,28 @@ int seq_dentry(struct seq_file *m, struct dentry *dentry, const char *esc)
}
EXPORT_SYMBOL(seq_dentry);

+void seq_mnt_opts(struct seq_file *m, int mnt_flags)
+{
+ unsigned int i;
+ static const struct {
+ int flag;
+ const char *str;
+ } mnt_info[] = {
+ { MNT_NOSUID, ",nosuid" },
+ { MNT_NODEV, ",nodev" },
+ { MNT_NOEXEC, ",noexec" },
+ { MNT_NOATIME, ",noatime" },
+ { MNT_NODIRATIME, ",nodiratime" },
+ { MNT_RELATIME, ",relatime" },
+ { 0, NULL }
+ };
+
+ for (i = 0; mnt_info[i].flag; i++) {
+ if (mnt_flags & mnt_info[i].flag)
+ seq_puts(m, mnt_info[i].str);
+ }
+}
+
static void *single_start(struct seq_file *p, loff_t *pos)
{
return NULL + (*pos == 0);
diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h
index 770c2bf3aa43..9dd7812eb777 100644
--- a/include/linux/seq_file.h
+++ b/include/linux/seq_file.h
@@ -138,6 +138,7 @@ int seq_file_path(struct seq_file *, struct file *, const char *);
int seq_dentry(struct seq_file *, struct dentry *, const char *);
int seq_path_root(struct seq_file *m, const struct path *path,
const struct path *root, const char *esc);
+void seq_mnt_opts(struct seq_file *m, int mnt_flags);

int single_open(struct file *, int (*)(struct seq_file *, void *), void *);
int single_open_size(struct file *, int (*)(struct seq_file *, void *), void *, size_t);
diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
index 19be60ab950e..78deb8483d27 100644
--- a/samples/vfs/Makefile
+++ b/samples/vfs/Makefile
@@ -4,6 +4,7 @@
hostprogs := \
test-fsinfo \
test-fsmount \
+ test-fsinfo-perf \
test-mntinfo \
test-statx

@@ -12,6 +13,7 @@ always-y := $(hostprogs)
HOSTCFLAGS_test-fsinfo.o += -I$(objtree)/usr/include
HOSTLDLIBS_test-fsinfo += -static -lm
HOSTCFLAGS_test-mntinfo.o += -I$(objtree)/usr/include
+HOSTCFLAGS_test-fsinfo-perf.o += -I$(objtree)/usr/include

HOSTCFLAGS_test-fsmount.o += -I$(objtree)/usr/include
HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include

2020-03-31 19:43:54

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Tue, Mar 31, 2020 at 7:31 PM David Howells <[email protected]> wrote:
>
> Miklos Szeredi <[email protected]> wrote:
>
> > The basic problem in my view, is that the performance requirement of a
> > "get filesystem information" type of API just does not warrant a
> > binary coded interface. I've said this a number of times, but it fell
> > on deaf ears.
>
> It hasn't so fallen, but don't necessarily agree with you. Let's pin some
> numbers on this.

Cool, thanks for testing. Unfortunately the test-fsinfo-perf.c file
didn't make it into the patch. Can you please refresh and resend?

> Okay, the results:
>
> For 1000 mounts, f= 1514us f2= 1102us p= 6014us p2= 6935us; p=4.0*f p=5.5*f2 p=0.9*p2
> For 2000 mounts, f= 4712us f2= 3675us p= 20937us p2= 22878us; p=4.4*f p=5.7*f2 p=0.9*p2
> For 3000 mounts, f= 6795us f2= 5304us p= 31080us p2= 34056us; p=4.6*f p=5.9*f2 p=0.9*p2
> For 4000 mounts, f= 9291us f2= 7434us p= 40723us p2= 46479us; p=4.4*f p=5.5*f2 p=0.9*p2
> For 5000 mounts, f=11423us f2= 9219us p= 50878us p2= 58857us; p=4.5*f p=5.5*f2 p=0.9*p2
> For 10000 mounts, f=22899us f2=18240us p=101054us p2=117273us; p=4.4*f p=5.5*f2 p=0.9*p2
> For 20000 mounts, f=45811us f2=37211us p=203640us p2=237377us; p=4.4*f p=5.5*f2 p=0.9*p2
> For 30000 mounts, f=69703us f2=54800us p=306778us p2=357629us; p=4.4*f p=5.6*f2 p=0.9*p2

So even the p2 method will give at least 80k queries/s, which is quite
good, considering that the need to rescan the complete mount tree
should be exceedingly rare (and in case it mattered, could be
optimized by priming from /proc/self/mountinfo).

Thanks,
Miklos

2020-03-31 19:48:39

by David Howells

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

Miklos Szeredi <[email protected]> wrote:

> Cool, thanks for testing. Unfortunately the test-fsinfo-perf.c file
> didn't make it into the patch. Can you please refresh and resend?

Oops - I forgot to add it. See attached.

David
---
commit b7239021cb7660bf328bb7fcce05e3a35ce5842b
Author: David Howells <[email protected]>
Date: Tue Mar 31 14:39:07 2020 +0100

Performance test Miklós's patch vs fsinfo

diff --git a/fs/Makefile b/fs/Makefile
index b6bf2424c7f7..ac0627176db1 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -137,3 +137,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/
obj-$(CONFIG_EROFS_FS) += erofs/
obj-$(CONFIG_VBOXSF_FS) += vboxsf/
obj-$(CONFIG_ZONEFS_FS) += zonefs/
+obj-y += mountfs/
diff --git a/fs/mount.h b/fs/mount.h
index 063f41bc2e93..89b091fc482f 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -82,6 +82,7 @@ struct mount {
atomic_t mnt_subtree_notifications; /* Number of notifications in subtree */
struct watch_list *mnt_watchers; /* Watches on dentries within this mount */
#endif
+ struct mountfs_entry *mnt_mountfs_entry;
} __randomize_layout;

#define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */
@@ -177,3 +178,11 @@ static inline void notify_mount(struct mount *triggered,
{
}
#endif
+
+void mnt_namespace_lock_read(void);
+void mnt_namespace_unlock_read(void);
+
+void mountfs_create(struct mount *mnt);
+extern void mountfs_remove(struct mount *mnt);
+int mountfs_lookup_internal(struct vfsmount *m, struct path *path);
+
diff --git a/fs/mountfs/Makefile b/fs/mountfs/Makefile
new file mode 100644
index 000000000000..35a65e9a966f
--- /dev/null
+++ b/fs/mountfs/Makefile
@@ -0,0 +1 @@
+obj-y += super.o
diff --git a/fs/mountfs/super.c b/fs/mountfs/super.c
new file mode 100644
index 000000000000..82c01eb6154d
--- /dev/null
+++ b/fs/mountfs/super.c
@@ -0,0 +1,502 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "../pnode.h"
+#include <linux/fs.h>
+#include <linux/kref.h>
+#include <linux/nsproxy.h>
+#include <linux/fs_struct.h>
+#include <linux/fs_context.h>
+
+#define MOUNTFS_SUPER_MAGIC 0x4e756f4d
+
+static DEFINE_SPINLOCK(mountfs_lock);
+static struct rb_root mountfs_entries = RB_ROOT;
+static struct vfsmount *mountfs_mnt __read_mostly;
+
+struct mountfs_entry {
+ struct kref kref;
+ struct mount *mnt;
+ struct rb_node node;
+ int id;
+};
+
+static const char *mountfs_attrs[] = {
+ "root", "mountpoint", "id", "parent", "options", "children",
+ "group", "master", "propagate_from"
+};
+
+#define MOUNTFS_INO(id) (((unsigned long) id + 1) * \
+ (ARRAY_SIZE(mountfs_attrs) + 1))
+
+void mountfs_entry_release(struct kref *kref)
+{
+ kfree(container_of(kref, struct mountfs_entry, kref));
+}
+
+void mountfs_entry_put(struct mountfs_entry *entry)
+{
+ kref_put(&entry->kref, mountfs_entry_release);
+}
+
+static bool mountfs_entry_visible(struct mountfs_entry *entry)
+{
+ struct mount *mnt;
+ bool visible = false;
+
+ rcu_read_lock();
+ mnt = rcu_dereference(entry->mnt);
+ if (mnt && mnt->mnt_ns == current->nsproxy->mnt_ns)
+ visible = true;
+ rcu_read_unlock();
+
+ return visible;
+}
+static int mountfs_attr_show(struct seq_file *sf, void *v)
+{
+ const char *name = sf->file->f_path.dentry->d_name.name;
+ struct mountfs_entry *entry = sf->private;
+ struct mount *mnt;
+ struct vfsmount *m;
+ struct super_block *sb;
+ struct path root;
+ int tmp, err = -ENODEV;
+
+ mnt_namespace_lock_read();
+
+ mnt = entry->mnt;
+ if (!mnt || !mnt->mnt_ns)
+ goto out;
+
+ err = 0;
+ m = &mnt->mnt;
+ sb = m->mnt_sb;
+
+ if (strcmp(name, "root") == 0) {
+ if (sb->s_op->show_path) {
+ err = sb->s_op->show_path(sf, m->mnt_root);
+ } else {
+ seq_dentry(sf, m->mnt_root, " \t\n\\");
+ }
+ seq_putc(sf, '\n');
+ } else if (strcmp(name, "mountpoint") == 0) {
+ struct path mnt_path = { .dentry = m->mnt_root, .mnt = m };
+
+ get_fs_root(current->fs, &root);
+ err = seq_path_root(sf, &mnt_path, &root, " \t\n\\");
+ if (err == SEQ_SKIP) {
+ seq_puts(sf, "(unreachable)");
+ err = 0;
+ }
+ seq_putc(sf, '\n');
+ path_put(&root);
+ } else if (strcmp(name, "id") == 0) {
+ seq_printf(sf, "%i\n", mnt->mnt_id);
+ } else if (strcmp(name, "parent") == 0) {
+ tmp = rcu_dereference(mnt->mnt_parent)->mnt_id;
+ seq_printf(sf, "%i\n", tmp);
+ } else if (strcmp(name, "options") == 0) {
+ int mnt_flags = READ_ONCE(m->mnt_flags);
+
+ seq_puts(sf, mnt_flags & MNT_READONLY ? "ro" : "rw");
+ seq_mnt_opts(sf, mnt_flags);
+ seq_putc(sf, '\n');
+ } else if (strcmp(name, "children") == 0) {
+ struct mount *child;
+ bool first = true;
+
+ list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
+ if (!first)
+ seq_putc(sf, ',');
+ else
+ first = false;
+ seq_printf(sf, "%i", child->mnt_id);
+ }
+ if (!first)
+ seq_putc(sf, '\n');
+ } else if (strcmp(name, "group") == 0) {
+ if (IS_MNT_SHARED(mnt))
+ seq_printf(sf, "%i\n", mnt->mnt_group_id);
+ } else if (strcmp(name, "master") == 0) {
+ if (IS_MNT_SLAVE(mnt)) {
+ tmp = rcu_dereference(mnt->mnt_master)->mnt_group_id;
+ seq_printf(sf, "%i\n", tmp);
+ }
+ } else if (strcmp(name, "propagate_from") == 0) {
+ if (IS_MNT_SLAVE(mnt)) {
+ get_fs_root(current->fs, &root);
+ tmp = get_dominating_id(mnt, &root);
+ if (tmp)
+ seq_printf(sf, "%i\n", tmp);
+ }
+ } else {
+ WARN_ON(1);
+ err = -EIO;
+ }
+out:
+ mnt_namespace_unlock_read();
+
+ return err;
+}
+
+static int mountfs_attr_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, mountfs_attr_show, inode->i_private);
+}
+
+static const struct file_operations mountfs_attr_fops = {
+ .open = mountfs_attr_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static struct mountfs_entry *mountfs_node_to_entry(struct rb_node *node)
+{
+ return rb_entry(node, struct mountfs_entry, node);
+}
+
+static struct rb_node **mountfs_find_node(int id, struct rb_node **parent)
+{
+ struct rb_node **link = &mountfs_entries.rb_node;
+
+ *parent = NULL;
+ while (*link) {
+ struct mountfs_entry *entry = mountfs_node_to_entry(*link);
+
+ *parent = *link;
+ if (id < entry->id)
+ link = &entry->node.rb_left;
+ else if (id > entry->id)
+ link = &entry->node.rb_right;
+ else
+ break;
+ }
+ return link;
+}
+
+void mountfs_create(struct mount *mnt)
+{
+ struct mountfs_entry *entry;
+ struct rb_node **link, *parent;
+
+ entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+ if (!entry) {
+ WARN(1, "failed to allocate mountfs entry");
+ return;
+ }
+ kref_init(&entry->kref);
+ entry->mnt = mnt;
+ entry->id = mnt->mnt_id;
+
+ spin_lock(&mountfs_lock);
+ link = mountfs_find_node(entry->id, &parent);
+ if (!WARN_ON(*link)) {
+ rb_link_node(&entry->node, parent, link);
+ rb_insert_color(&entry->node, &mountfs_entries);
+ mnt->mnt_mountfs_entry = entry;
+ } else {
+ kfree(entry);
+ }
+ spin_unlock(&mountfs_lock);
+}
+
+void mountfs_remove(struct mount *mnt)
+{
+ struct mountfs_entry *entry = mnt->mnt_mountfs_entry;
+
+ if (!entry)
+ return;
+ spin_lock(&mountfs_lock);
+ entry->mnt = NULL;
+ rb_erase(&entry->node, &mountfs_entries);
+ spin_unlock(&mountfs_lock);
+
+ mountfs_entry_put(entry);
+
+ mnt->mnt_mountfs_entry = NULL;
+}
+
+static struct mountfs_entry *mountfs_get_entry(const char *name)
+{
+ struct mountfs_entry *entry = NULL;
+ struct rb_node **link, *dummy;
+ unsigned long mnt_id;
+ char buf[32];
+ int ret;
+
+ ret = kstrtoul(name, 10, &mnt_id);
+ if (ret || mnt_id > INT_MAX)
+ return NULL;
+
+ snprintf(buf, sizeof(buf), "%lu", mnt_id);
+ if (strcmp(buf, name) != 0)
+ return NULL;
+
+ spin_lock(&mountfs_lock);
+ link = mountfs_find_node(mnt_id, &dummy);
+ if (*link) {
+ entry = mountfs_node_to_entry(*link);
+ if (!mountfs_entry_visible(entry))
+ entry = NULL;
+ else
+ kref_get(&entry->kref);
+ }
+ spin_unlock(&mountfs_lock);
+
+ return entry;
+}
+
+static void mountfs_init_inode(struct inode *inode, umode_t mode);
+
+static struct dentry *mountfs_lookup_entry(struct dentry *dentry,
+ struct mountfs_entry *entry,
+ int idx)
+{
+ struct inode *inode;
+
+ inode = new_inode(dentry->d_sb);
+ if (!inode) {
+ mountfs_entry_put(entry);
+ return ERR_PTR(-ENOMEM);
+ }
+ inode->i_private = entry;
+ inode->i_ino = MOUNTFS_INO(entry->id) + idx;
+ mountfs_init_inode(inode, idx ? S_IFREG | 0444 : S_IFDIR | 0555);
+ return d_splice_alias(inode, dentry);
+
+}
+
+static struct dentry *mountfs_lookup(struct inode *dir, struct dentry *dentry,
+ unsigned int flags)
+{
+ struct mountfs_entry *entry = dir->i_private;
+ int i = 0;
+
+ if (entry) {
+ for (i = 0; i < ARRAY_SIZE(mountfs_attrs); i++)
+ if (strcmp(mountfs_attrs[i], dentry->d_name.name) == 0)
+ break;
+ if (i == ARRAY_SIZE(mountfs_attrs))
+ return ERR_PTR(-ENOMEM);
+ i++;
+ kref_get(&entry->kref);
+ } else {
+ entry = mountfs_get_entry(dentry->d_name.name);
+ if (!entry)
+ return ERR_PTR(-ENOENT);
+ }
+
+ return mountfs_lookup_entry(dentry, entry, i);
+}
+
+static int mountfs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+ struct mountfs_entry *entry = dentry->d_inode->i_private;
+
+ /* root: valid */
+ if (!entry)
+ return 1;
+
+ /* removed: invalid */
+ if (!entry->mnt)
+ return 0;
+
+ /* attribute or visible in this namespace: valid */
+ if (!d_can_lookup(dentry) || mountfs_entry_visible(entry))
+ return 1;
+
+ /* invlisible in this namespace: valid but deny entry*/
+ return -ENOENT;
+}
+
+static int mountfs_readdir(struct file *file, struct dir_context *ctx)
+{
+ struct rb_node *node;
+ struct mountfs_entry *entry = file_inode(file)->i_private;
+ char name[32];
+ const char *s;
+ unsigned int len, pos, id;
+
+ if (ctx->pos - 2 > INT_MAX || !dir_emit_dots(file, ctx))
+ return 0;
+
+ if (entry) {
+ while (ctx->pos - 2 < ARRAY_SIZE(mountfs_attrs)) {
+ s = mountfs_attrs[ctx->pos - 2];
+ if (!dir_emit(ctx, s, strlen(s),
+ MOUNTFS_INO(entry->id) + ctx->pos,
+ DT_REG))
+ break;
+ ctx->pos++;
+ }
+ return 0;
+ }
+
+ pos = ctx->pos - 2;
+ do {
+ spin_lock(&mountfs_lock);
+ mountfs_find_node(pos, &node);
+ pos = 1U + INT_MAX;
+ do {
+ if (!node) {
+ spin_unlock(&mountfs_lock);
+ goto out;
+ }
+ entry = mountfs_node_to_entry(node);
+ node = rb_next(node);
+ } while (!mountfs_entry_visible(entry));
+ if (node)
+ pos = mountfs_node_to_entry(node)->id;
+ id = entry->id;
+ spin_unlock(&mountfs_lock);
+
+ len = snprintf(name, sizeof(name), "%i", id);
+ ctx->pos = id + 2;
+ if (!dir_emit(ctx, name, len, MOUNTFS_INO(id), DT_DIR))
+ return 0;
+ } while (pos <= INT_MAX);
+out:
+ ctx->pos = pos + 2;
+ return 0;
+}
+
+int mountfs_lookup_internal(struct vfsmount *m, struct path *path)
+{
+ char name[32];
+ struct qstr this = { .name = name };
+ struct mount *mnt = real_mount(m);
+ struct mountfs_entry *entry = mnt->mnt_mountfs_entry;
+ struct dentry *dentry, *old, *root = mountfs_mnt->mnt_root;
+
+ this.len = snprintf(name, sizeof(name), "%i", mnt->mnt_id);
+ dentry = d_hash_and_lookup(root, &this);
+ if (dentry && dentry->d_inode->i_private != entry) {
+ d_invalidate(dentry);
+ dput(dentry);
+ dentry = NULL;
+ }
+ if (!dentry) {
+ dentry = d_alloc(root, &this);
+ if (!dentry)
+ return -ENOMEM;
+
+ kref_get(&entry->kref);
+ old = mountfs_lookup_entry(dentry, entry, 0);
+ if (old) {
+ dput(dentry);
+ if (IS_ERR(old))
+ return PTR_ERR(old);
+ dentry = old;
+ }
+ }
+
+ *path = (struct path) { .mnt = mountfs_mnt, .dentry = dentry };
+ return 0;
+}
+
+static const struct dentry_operations mountfs_dops = {
+ .d_revalidate = mountfs_d_revalidate,
+};
+
+static const struct inode_operations mountfs_iops = {
+ .lookup = mountfs_lookup,
+};
+
+static const struct file_operations mountfs_fops = {
+ .iterate_shared = mountfs_readdir,
+ .read = generic_read_dir,
+ .llseek = generic_file_llseek,
+};
+
+static void mountfs_init_inode(struct inode *inode, umode_t mode)
+{
+ inode->i_mode = mode;
+ inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
+ if (S_ISREG(mode)) {
+ inode->i_size = PAGE_SIZE;
+ inode->i_fop = &mountfs_attr_fops;
+ } else {
+ inode->i_op = &mountfs_iops;
+ inode->i_fop = &mountfs_fops;
+ }
+}
+
+static void mountfs_evict_inode(struct inode *inode)
+{
+ struct mountfs_entry *entry = inode->i_private;
+
+ clear_inode(inode);
+ if (entry)
+ mountfs_entry_put(entry);
+}
+
+static const struct super_operations mountfs_sops = {
+ .statfs = simple_statfs,
+ .drop_inode = generic_delete_inode,
+ .evict_inode = mountfs_evict_inode,
+};
+
+static int mountfs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+ struct inode *root;
+
+ sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV;
+ sb->s_blocksize = PAGE_SIZE;
+ sb->s_blocksize_bits = PAGE_SHIFT;
+ sb->s_magic = MOUNTFS_SUPER_MAGIC;
+ sb->s_time_gran = 1;
+ sb->s_shrink.seeks = 0;
+ sb->s_op = &mountfs_sops;
+ sb->s_d_op = &mountfs_dops;
+
+ root = new_inode(sb);
+ if (!root)
+ return -ENOMEM;
+
+ root->i_ino = 1;
+ mountfs_init_inode(root, S_IFDIR | 0444);
+
+ sb->s_root = d_make_root(root);
+ if (!sb->s_root)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static int mountfs_get_tree(struct fs_context *fc)
+{
+ return get_tree_single(fc, mountfs_fill_super);
+}
+
+static const struct fs_context_operations mountfs_context_ops = {
+ .get_tree = mountfs_get_tree,
+};
+
+static int mountfs_init_fs_context(struct fs_context *fc)
+{
+ fc->ops = &mountfs_context_ops;
+ fc->global = true;
+ return 0;
+}
+
+static struct file_system_type mountfs_fs_type = {
+ .name = "mountfs",
+ .init_fs_context = mountfs_init_fs_context,
+ .kill_sb = kill_anon_super,
+};
+
+static int __init mountfs_init(void)
+{
+ int err;
+
+ err = register_filesystem(&mountfs_fs_type);
+ if (!err) {
+ mountfs_mnt = kern_mount(&mountfs_fs_type);
+ if (IS_ERR(mountfs_mnt)) {
+ err = PTR_ERR(mountfs_mnt);
+ unregister_filesystem(&mountfs_fs_type);
+ }
+ }
+ return err;
+}
+fs_initcall(mountfs_init);
diff --git a/fs/namespace.c b/fs/namespace.c
index 5427e732c1bf..a05a2885a90e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -962,6 +962,8 @@ struct vfsmount *vfs_create_mount(struct fs_context *fc)

if (fc->sb_flags & SB_KERNMOUNT)
mnt->mnt.mnt_flags = MNT_INTERNAL;
+ else
+ mountfs_create(mnt);

atomic_inc(&fc->root->d_sb->s_active);
mnt->mnt.mnt_sb = fc->root->d_sb;
@@ -1033,7 +1035,7 @@ vfs_submount(const struct dentry *mountpoint, struct file_system_type *type,
}
EXPORT_SYMBOL_GPL(vfs_submount);

-static struct mount *clone_mnt(struct mount *old, struct dentry *root,
+static struct mount *clone_mnt_common(struct mount *old, struct dentry *root,
int flag)
{
struct super_block *sb = old->mnt.mnt_sb;
@@ -1100,6 +1102,17 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
return ERR_PTR(err);
}

+static struct mount *clone_mnt(struct mount *old, struct dentry *root,
+ int flag)
+{
+ struct mount *mnt = clone_mnt_common(old, root, flag);
+
+ if (!IS_ERR(mnt))
+ mountfs_create(mnt);
+
+ return mnt;
+}
+
static void cleanup_mnt(struct mount *mnt)
{
struct hlist_node *p;
@@ -1112,6 +1125,7 @@ static void cleanup_mnt(struct mount *mnt)
* so mnt_get_writers() below is safe.
*/
WARN_ON(mnt_get_writers(mnt));
+
if (unlikely(mnt->mnt_pins.first))
mnt_pin_kill(mnt);
hlist_for_each_entry_safe(m, p, &mnt->mnt_stuck_children, mnt_umount) {
@@ -1197,6 +1211,8 @@ static void mntput_no_expire(struct mount *mnt)
unlock_mount_hash();
shrink_dentry_list(&list);

+ mountfs_remove(mnt);
+
if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
struct task_struct *task = current;
if (likely(!(task->flags & PF_KTHREAD))) {
@@ -1263,13 +1279,14 @@ EXPORT_SYMBOL(path_is_mountpoint);
struct vfsmount *mnt_clone_internal(const struct path *path)
{
struct mount *p;
- p = clone_mnt(real_mount(path->mnt), path->dentry, CL_PRIVATE);
+ p = clone_mnt_common(real_mount(path->mnt), path->dentry, CL_PRIVATE);
if (IS_ERR(p))
return ERR_CAST(p);
p->mnt.mnt_flags |= MNT_INTERNAL;
return &p->mnt;
}

+
#ifdef CONFIG_PROC_FS
/* iterator; we want it to have access to namespace_sem, thus here... */
static void *m_start(struct seq_file *m, loff_t *pos)
@@ -1411,6 +1428,16 @@ static inline void namespace_lock(void)
down_write(&namespace_sem);
}

+void mnt_namespace_lock_read(void)
+{
+ down_read(&namespace_sem);
+}
+
+void mnt_namespace_unlock_read(void)
+{
+ up_read(&namespace_sem);
+}
+
enum umount_tree_flags {
UMOUNT_SYNC = 1,
UMOUNT_PROPAGATE = 2,
diff --git a/fs/proc/base.c b/fs/proc/base.c
index c7c64272b0fa..0477f8b51182 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3092,6 +3092,7 @@ static const struct pid_entry tgid_base_stuff[] = {
DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
DIR("map_files", S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
DIR("fdinfo", S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
+ DIR("fdmount", S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations),
DIR("ns", S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
#ifdef CONFIG_NET
DIR("net", S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
@@ -3497,6 +3498,7 @@ static const struct inode_operations proc_tid_comm_inode_operations = {
static const struct pid_entry tid_base_stuff[] = {
DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
DIR("fdinfo", S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
+ DIR("fdmount", S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations),
DIR("ns", S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
#ifdef CONFIG_NET
DIR("net", S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 81882a13212d..94a57e178801 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -361,3 +361,85 @@ const struct file_operations proc_fdinfo_operations = {
.iterate_shared = proc_readfdinfo,
.llseek = generic_file_llseek,
};
+
+static int proc_fdmount_link(struct dentry *dentry, struct path *path)
+{
+ struct files_struct *files = NULL;
+ struct task_struct *task;
+ struct path fd_path;
+ int ret = -ENOENT;
+
+ task = get_proc_task(d_inode(dentry));
+ if (task) {
+ files = get_files_struct(task);
+ put_task_struct(task);
+ }
+
+ if (files) {
+ unsigned int fd = proc_fd(d_inode(dentry));
+ struct file *fd_file;
+
+ spin_lock(&files->file_lock);
+ fd_file = fcheck_files(files, fd);
+ if (fd_file) {
+ fd_path = fd_file->f_path;
+ path_get(&fd_path);
+ ret = 0;
+ }
+ spin_unlock(&files->file_lock);
+ put_files_struct(files);
+ }
+ if (!ret) {
+ ret = mountfs_lookup_internal(fd_path.mnt, path);
+ path_put(&fd_path);
+ }
+
+ return ret;
+}
+
+static struct dentry *proc_fdmount_instantiate(struct dentry *dentry,
+ struct task_struct *task, const void *ptr)
+{
+ const struct fd_data *data = ptr;
+ struct proc_inode *ei;
+ struct inode *inode;
+
+ inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | 0400);
+ if (!inode)
+ return ERR_PTR(-ENOENT);
+
+ ei = PROC_I(inode);
+ ei->fd = data->fd;
+
+ inode->i_op = &proc_pid_link_inode_operations;
+ inode->i_size = 64;
+
+ ei->op.proc_get_link = proc_fdmount_link;
+ tid_fd_update_inode(task, inode, 0);
+
+ d_set_d_op(dentry, &tid_fd_dentry_operations);
+ return d_splice_alias(inode, dentry);
+}
+
+static struct dentry *
+proc_lookupfdmount(struct inode *dir, struct dentry *dentry, unsigned int flags)
+{
+ return proc_lookupfd_common(dir, dentry, proc_fdmount_instantiate);
+}
+
+static int proc_readfdmount(struct file *file, struct dir_context *ctx)
+{
+ return proc_readfd_common(file, ctx,
+ proc_fdmount_instantiate);
+}
+
+const struct inode_operations proc_fdmount_inode_operations = {
+ .lookup = proc_lookupfdmount,
+ .setattr = proc_setattr,
+};
+
+const struct file_operations proc_fdmount_operations = {
+ .read = generic_read_dir,
+ .iterate_shared = proc_readfdmount,
+ .llseek = generic_file_llseek,
+};
diff --git a/fs/proc/fd.h b/fs/proc/fd.h
index f371a602bf58..9e087c833e65 100644
--- a/fs/proc/fd.h
+++ b/fs/proc/fd.h
@@ -10,6 +10,9 @@ extern const struct inode_operations proc_fd_inode_operations;
extern const struct file_operations proc_fdinfo_operations;
extern const struct inode_operations proc_fdinfo_inode_operations;

+extern const struct file_operations proc_fdmount_operations;
+extern const struct inode_operations proc_fdmount_inode_operations;
+
extern int proc_fd_permission(struct inode *inode, int mask);

static inline unsigned int proc_fd(struct inode *inode)
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 273ee82d8aa9..e634faa9160e 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -61,24 +61,6 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
return security_sb_show_options(m, sb);
}

-static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
-{
- static const struct proc_fs_info mnt_info[] = {
- { MNT_NOSUID, ",nosuid" },
- { MNT_NODEV, ",nodev" },
- { MNT_NOEXEC, ",noexec" },
- { MNT_NOATIME, ",noatime" },
- { MNT_NODIRATIME, ",nodiratime" },
- { MNT_RELATIME, ",relatime" },
- { 0, NULL }
- };
- const struct proc_fs_info *fs_infop;
-
- for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) {
- if (mnt->mnt_flags & fs_infop->flag)
- seq_puts(m, fs_infop->str);
- }
-}

static inline void mangle(struct seq_file *m, const char *s)
{
@@ -120,7 +102,7 @@ static int show_vfsmnt(struct seq_file *m, struct vfsmount *mnt)
err = show_sb_opts(m, sb);
if (err)
goto out;
- show_mnt_opts(m, mnt);
+ seq_mnt_opts(m, mnt->mnt_flags);
if (sb->s_op->show_options)
err = sb->s_op->show_options(m, mnt_path.dentry);
seq_puts(m, " 0 0\n");
@@ -153,7 +135,7 @@ static int show_mountinfo(struct seq_file *m, struct vfsmount *mnt)
goto out;

seq_puts(m, mnt->mnt_flags & MNT_READONLY ? " ro" : " rw");
- show_mnt_opts(m, mnt);
+ seq_mnt_opts(m, mnt->mnt_flags);

/* Tagged fields ("foo:X" or "bar") */
if (IS_MNT_SHARED(r))
diff --git a/fs/seq_file.c b/fs/seq_file.c
index 1600034a929b..9726baba1732 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -15,6 +15,7 @@
#include <linux/cred.h>
#include <linux/mm.h>
#include <linux/printk.h>
+#include <linux/mount.h>
#include <linux/string_helpers.h>

#include <linux/uaccess.h>
@@ -548,6 +549,28 @@ int seq_dentry(struct seq_file *m, struct dentry *dentry, const char *esc)
}
EXPORT_SYMBOL(seq_dentry);

+void seq_mnt_opts(struct seq_file *m, int mnt_flags)
+{
+ unsigned int i;
+ static const struct {
+ int flag;
+ const char *str;
+ } mnt_info[] = {
+ { MNT_NOSUID, ",nosuid" },
+ { MNT_NODEV, ",nodev" },
+ { MNT_NOEXEC, ",noexec" },
+ { MNT_NOATIME, ",noatime" },
+ { MNT_NODIRATIME, ",nodiratime" },
+ { MNT_RELATIME, ",relatime" },
+ { 0, NULL }
+ };
+
+ for (i = 0; mnt_info[i].flag; i++) {
+ if (mnt_flags & mnt_info[i].flag)
+ seq_puts(m, mnt_info[i].str);
+ }
+}
+
static void *single_start(struct seq_file *p, loff_t *pos)
{
return NULL + (*pos == 0);
diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h
index 770c2bf3aa43..9dd7812eb777 100644
--- a/include/linux/seq_file.h
+++ b/include/linux/seq_file.h
@@ -138,6 +138,7 @@ int seq_file_path(struct seq_file *, struct file *, const char *);
int seq_dentry(struct seq_file *, struct dentry *, const char *);
int seq_path_root(struct seq_file *m, const struct path *path,
const struct path *root, const char *esc);
+void seq_mnt_opts(struct seq_file *m, int mnt_flags);

int single_open(struct file *, int (*)(struct seq_file *, void *), void *);
int single_open_size(struct file *, int (*)(struct seq_file *, void *), void *, size_t);
diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
index 19be60ab950e..78deb8483d27 100644
--- a/samples/vfs/Makefile
+++ b/samples/vfs/Makefile
@@ -4,6 +4,7 @@
hostprogs := \
test-fsinfo \
test-fsmount \
+ test-fsinfo-perf \
test-mntinfo \
test-statx

@@ -12,6 +13,7 @@ always-y := $(hostprogs)
HOSTCFLAGS_test-fsinfo.o += -I$(objtree)/usr/include
HOSTLDLIBS_test-fsinfo += -static -lm
HOSTCFLAGS_test-mntinfo.o += -I$(objtree)/usr/include
+HOSTCFLAGS_test-fsinfo-perf.o += -I$(objtree)/usr/include

HOSTCFLAGS_test-fsmount.o += -I$(objtree)/usr/include
HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
diff --git a/samples/vfs/test-fsinfo-perf.c b/samples/vfs/test-fsinfo-perf.c
new file mode 100644
index 000000000000..fba40737f768
--- /dev/null
+++ b/samples/vfs/test-fsinfo-perf.c
@@ -0,0 +1,361 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Test the fsinfo() system call
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells ([email protected])
+ */
+
+#define _GNU_SOURCE
+#define _ATFILE_SOURCE
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <unistd.h>
+#include <ctype.h>
+#include <errno.h>
+#include <time.h>
+#include <math.h>
+#include <fcntl.h>
+#include <sys/syscall.h>
+#include <sys/stat.h>
+#include <sys/mount.h>
+#include <sys/time.h>
+#include <linux/fsinfo.h>
+
+#ifndef __NR_fsinfo
+#define __NR_fsinfo -1
+#endif
+
+#define ERR(ret, what) do { if ((long)(ret) == -1) { perror(what); exit(1); } } while(0)
+#define OOM(ret) do { if (!(ret)) { perror(NULL); exit(1); } } while(0)
+
+static int nr_mounts = 3;
+static const char *base_path;
+
+static __attribute__((unused))
+ssize_t fsinfo(int dfd, const char *filename,
+ struct fsinfo_params *params, size_t params_size,
+ void *result_buffer, size_t result_buf_size)
+{
+ return syscall(__NR_fsinfo, dfd, filename,
+ params, params_size,
+ result_buffer, result_buf_size);
+}
+
+static void iterate(void (*func)(int i, const char *))
+{
+ char name[4096];
+ int i;
+
+ for (i = 0; i < nr_mounts; i++) {
+ sprintf(name, "%s/%d", base_path, i);
+ func(i, name);
+ }
+}
+
+static void make_mount(int ix, const char *path)
+{
+ ERR(mkdir(path, 0755), "mkdir");
+ ERR(mount("none", path, "tmpfs", 0, NULL), "mount");
+ ERR(mount("none", path, NULL, MS_PRIVATE, NULL), "mount");
+}
+
+static void do_umount(void)
+{
+ printf("--- umount ---\n");
+ if (umount2(base_path, MNT_DETACH) == -1)
+ perror("umount");
+}
+
+static unsigned long sum_mnt_id;
+
+static void get_mntid_by_fsinfo(int ix, const char *path)
+{
+ struct fsinfo_mount_info r;
+ struct fsinfo_params params = {
+ .flags = FSINFO_FLAGS_QUERY_PATH,
+ .request = FSINFO_ATTR_MOUNT_INFO,
+ };
+
+ ERR(fsinfo(AT_FDCWD, path, &params, sizeof(params), &r, sizeof(r)),
+ "fsinfo");
+ //printf("[%u] %u\n", ix, r.mnt_id);
+ sum_mnt_id += r.mnt_id;
+}
+
+static void get_mntid_by_proc(int ix, const char *path)
+{
+ unsigned int mnt_id;
+ ssize_t len;
+ char procfile[100], buffer[4096], *p, *nl;
+ int fd, fd2;
+
+ fd = open(path, O_PATH);
+ ERR(fd, "open/path");
+ sprintf(procfile, "/proc/self/fdinfo/%u", fd);
+ fd2 = open(procfile, O_RDONLY);
+ ERR(fd2, "open/proc");
+ len = read(fd2, buffer, sizeof(buffer) - 1);
+ ERR(len, "read");
+ buffer[len] = 0;
+ close(fd2);
+ close(fd);
+
+ p = buffer;
+ do {
+ nl = strchr(p, '\n');
+ if (nl)
+ *nl++ = '\0';
+ else
+ nl = NULL;
+
+ if (strncmp(p, "mnt_id:", 7) != 0)
+ continue;
+ p += 7;
+ while (isblank(*p))
+ p++;
+ /* Have to allow for extra numbers being added to the line */
+ if (sscanf(p, "%u", &mnt_id) != 1) {
+ fprintf(stderr, "Bad format %s\n", procfile);
+ exit(3);
+ }
+ break;
+
+ } while ((p = nl));
+
+ if (!p) {
+ fprintf(stderr, "Missing field %s\n", procfile);
+ exit(3);
+ }
+
+ sum_mnt_id += mnt_id;
+ //printf("[%u] %u\n", ix, mnt_id);
+}
+
+static void get_mntid_by_fsinfo_2(void)
+{
+ struct fsinfo_mount_child *children, *c, *end;
+ struct fsinfo_mount_info r;
+ struct fsinfo_params params = {
+ .flags = FSINFO_FLAGS_QUERY_PATH,
+ .request = FSINFO_ATTR_MOUNT_INFO,
+ };
+ unsigned int base_mnt_id;
+ size_t s_children, n_children;
+ char name[32];
+ int i;
+
+ /* Convert path to mount ID */
+ ERR(fsinfo(AT_FDCWD, base_path, &params, sizeof(params), &r, sizeof(r)),
+ "fsinfo/base");
+ base_mnt_id = r.mnt_id;
+ //printf("[B] %u\n", base_mnt_id);
+
+ /* Get a list of all the children of this mount ID */
+ s_children = (nr_mounts + 1) * sizeof(*children);
+ children = malloc(s_children);
+ OOM(children);
+
+ params.flags = FSINFO_FLAGS_QUERY_MOUNT;
+ params.request = FSINFO_ATTR_MOUNT_CHILDREN;
+ sprintf(name, "%u", base_mnt_id);
+ s_children = fsinfo(AT_FDCWD, name, &params, sizeof(params), children, s_children);
+ ERR(s_children, "fsinfo/children");
+
+ /* Query each child */
+ n_children = s_children / sizeof(*c) - 1; // Parent is added at end
+ c = children;
+ end = c + n_children;
+ for (i = 0; c < end; c++, i++) {
+ //printf("[%u] %u\n", i, c->mnt_id);
+ params.flags = FSINFO_FLAGS_QUERY_MOUNT;
+ params.request = FSINFO_ATTR_MOUNT_INFO;
+ sprintf(name, "%u", c->mnt_id);
+ ERR(fsinfo(AT_FDCWD, name, &params, sizeof(params), &r, sizeof(r)),
+ "fsinfo/child");
+ sum_mnt_id += r.mnt_id;
+ }
+}
+
+static void get_mntid_by_mountfs(void)
+{
+ unsigned int base_mnt_id, mnt_id, x;
+ ssize_t len, s_children;
+ char procfile[100], buffer[100], *children, *p, *q, *nl, *comma;
+ int fd, fd2, mntfd, i;
+
+ /* Start off by reading the mount ID from the base path */
+ fd = open(base_path, O_PATH);
+ ERR(fd, "open/path");
+ sprintf(procfile, "/proc/self/fdinfo/%u", fd);
+ fd2 = open(procfile, O_RDONLY);
+ ERR(fd2, "open/proc");
+ len = read(fd2, buffer, sizeof(buffer) - 1);
+ ERR(len, "read");
+ buffer[len] = 0;
+ close(fd2);
+ close(fd);
+
+ p = buffer;
+ do {
+ nl = strchr(p, '\n');
+ if (nl)
+ *nl++ = '\0';
+ else
+ nl = NULL;
+
+ if (strncmp(p, "mnt_id:", 7) != 0)
+ continue;
+ p += 7;
+ while (isblank(*p))
+ p++;
+ /* Have to allow for extra numbers being added to the line */
+ if (sscanf(p, "%u", &base_mnt_id) != 1) {
+ fprintf(stderr, "Bad format %s\n", procfile);
+ exit(3);
+ }
+ break;
+
+ } while ((p = nl));
+
+ if (!p) {
+ fprintf(stderr, "Missing field %s\n", procfile);
+ exit(3);
+ }
+
+ if (0) printf("[B] %u\n", base_mnt_id);
+
+ mntfd = open("/mnt", O_PATH);
+ ERR(fd, "open/mountfs");
+
+ /* Get a list of all the children of this mount ID */
+ s_children = (nr_mounts) * 12;
+ children = malloc(s_children);
+ OOM(children);
+
+ sprintf(procfile, "%u/children", base_mnt_id);
+ fd = openat(mntfd, procfile, O_RDONLY);
+ ERR(fd, "open/children");
+ s_children = read(fd, children, s_children - 1);
+ ERR(s_children, "read/children");
+ close(fd);
+ if (s_children > 0 && children[s_children - 1] == '\n')
+ s_children--;
+ children[s_children] = 0;
+
+ /* Query each child */
+ p = children;
+ if (!*p)
+ return;
+ i = 0;
+ do {
+ mnt_id = strtoul(p, &comma, 10);
+ if (*comma) {
+ if (*comma != ',') {
+ fprintf(stderr, "Bad format in mountfs-children\n");
+ exit(3);
+ }
+ comma++;
+ }
+
+ sprintf(procfile, "%u/id", mnt_id);
+ fd = openat(mntfd, procfile, O_RDONLY);
+ ERR(fd, procfile);
+ len = read(fd, buffer, sizeof(buffer) - 1);
+ ERR(len, "read/id");
+ close(fd);
+ if (len > 0 && buffer[len - 1] == '\n')
+ len--;
+ buffer[len] = 0;
+
+ x = strtoul(buffer, &q, 10);
+
+ if (*q) {
+ fprintf(stderr, "Bad format in %s '%s'\n", procfile, buffer);
+ exit(3);
+ }
+
+ if (0) printf("[%u] %u\n", i++, x);
+ sum_mnt_id += x;
+ } while (p = comma, *comma);
+}
+
+static unsigned long duration(struct timeval *before, struct timeval *after)
+{
+ unsigned long a, b;
+
+ a = after->tv_sec * 1000000 + after->tv_usec;
+ b = before->tv_sec * 1000000 + before->tv_usec;
+ return a - b;
+}
+
+int main(int argc, char **argv)
+{
+ struct timeval f_before, f_after;
+ struct timeval f2_before, f2_after;
+ struct timeval p_before, p_after;
+ struct timeval p2_before, p2_after;
+ const char *path;
+ unsigned long f_dur, f2_dur, p_dur, p2_dur;
+
+ if (argc < 2) {
+ fprintf(stderr, "Format: %s <path> [nr_mounts]\n", argv[0]);
+ exit(2);
+ }
+
+ if (argc == 3)
+ nr_mounts = atoi(argv[2]);
+
+ path = argv[1];
+ ERR(mount("none", path, "tmpfs", 0, NULL), "mount");
+ ERR(mount("none", path, NULL, MS_PRIVATE, NULL), "mount");
+ base_path = path;
+ atexit(do_umount);
+
+ printf("--- make mounts ---\n");
+ iterate(make_mount);
+
+ printf("--- test fsinfo by path ---\n");
+ sum_mnt_id = 0;
+ ERR(gettimeofday(&f_before, NULL), "gettimeofday");
+ iterate(get_mntid_by_fsinfo);
+ ERR(gettimeofday(&f_after, NULL), "gettimeofday");
+ printf("sum(mnt_id) = %lu\n", sum_mnt_id);
+
+ printf("--- test fsinfo by mnt_id ---\n");
+ sum_mnt_id = 0;
+ ERR(gettimeofday(&f2_before, NULL), "gettimeofday");
+ get_mntid_by_fsinfo_2();
+ ERR(gettimeofday(&f2_after, NULL), "gettimeofday");
+ printf("sum(mnt_id) = %lu\n", sum_mnt_id);
+
+ printf("--- test /proc/fdinfo ---\n");
+ sum_mnt_id = 0;
+ ERR(gettimeofday(&p_before, NULL), "gettimeofday");
+ iterate(get_mntid_by_proc);
+ ERR(gettimeofday(&p_after, NULL), "gettimeofday");
+ printf("sum(mnt_id) = %lu\n", sum_mnt_id);
+
+ printf("--- test mountfs ---\n");
+ sum_mnt_id = 0;
+ ERR(gettimeofday(&p2_before, NULL), "gettimeofday");
+ get_mntid_by_mountfs();
+ ERR(gettimeofday(&p2_after, NULL), "gettimeofday");
+ printf("sum(mnt_id) = %lu\n", sum_mnt_id);
+
+ f_dur = duration(&f_before, &f_after);
+ f2_dur = duration(&f2_before, &f2_after);
+ p_dur = duration(&p_before, &p_after);
+ p2_dur = duration(&p2_before, &p2_after);
+ //printf("fsinfo duration %10luus for %d mounts\n", f_dur, nr_mounts);
+ //printf("procfd duration %10luus for %d mounts\n", p_dur, nr_mounts);
+
+ printf("For %7d mounts, f=%10luus f2=%10luus p=%10luus p2=%10luus; p=%.1f*f p=%.1f*f2 p=%.1f*p2\n",
+ nr_mounts, f_dur, f2_dur, p_dur, p2_dur,
+ (double)p_dur / (double)f_dur,
+ (double)p_dur / (double)f2_dur,
+ (double)p_dur / (double)p2_dur);
+ return 0;
+}

2020-03-31 21:15:10

by David Howells

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

Miklos Szeredi <[email protected]> wrote:

> So even the p2 method will give at least 80k queries/s, which is quite
> good, considering that the need to rescan the complete mount tree
> should be exceedingly rare (and in case it mattered, could be
> optimized by priming from /proc/self/mountinfo).

One thing to note is that the test is actually a little biased in favour of
the "p" test, where the mnt_id is looked up by path from /proc/fdinfo. That's
not all that useful, except as an index into mountfs. I'm not sure how much
use it as a check on whether the mount is the same mount or not since mount
IDs can get reused.

If I instead use the parent_id all round as the desired target value, I then
see:

For 10000 mounts, f=22899us f2=18240us p=101054us p2=117273us <-- prev email
For 10000 mounts, f=24853us f2=20453us p=235581us p2= 59798us <-- parent_id

Some observations:

(1) fsinfo() gets a bit slower, reflecting the extra locking that must be
done to access the topology information (it's using a different
attribute).

(2) Going via /proc/fdinfo now includes further a access into mountfs - and
this makes the access ~2.3x slower than it was before.

(3) Going via mount ID directly into mountfs (the "p2" test) appears faster
than it did (when it shouldn't have changed), though it's still slower
than fsinfo. This I ascribe to the caching of the inode and dentry from
the "p" test.

The attached patch adjusts the test program.

David
---
commit e9844e27f3061e4ef2d1511786b5ea60338dc610
Author: David Howells <[email protected]>
Date: Tue Mar 31 21:14:58 2020 +0100

Get parent ID instead

diff --git a/samples/vfs/test-fsinfo-perf.c b/samples/vfs/test-fsinfo-perf.c
index fba40737f768..2bcde06ee78b 100644
--- a/samples/vfs/test-fsinfo-perf.c
+++ b/samples/vfs/test-fsinfo-perf.c
@@ -69,27 +69,27 @@ static void do_umount(void)
perror("umount");
}

-static unsigned long sum_mnt_id;
+static unsigned long sum_check, sum_check_2;

-static void get_mntid_by_fsinfo(int ix, const char *path)
+static void get_id_by_fsinfo(int ix, const char *path)
{
- struct fsinfo_mount_info r;
+ struct fsinfo_mount_topology r;
struct fsinfo_params params = {
.flags = FSINFO_FLAGS_QUERY_PATH,
- .request = FSINFO_ATTR_MOUNT_INFO,
+ .request = FSINFO_ATTR_MOUNT_TOPOLOGY,
};

ERR(fsinfo(AT_FDCWD, path, &params, sizeof(params), &r, sizeof(r)),
"fsinfo");
- //printf("[%u] %u\n", ix, r.mnt_id);
- sum_mnt_id += r.mnt_id;
+ sum_check += r.parent_id;
+ sum_check_2 += r.mnt_topology_changes;
}

-static void get_mntid_by_proc(int ix, const char *path)
+static void get_id_by_proc(int ix, const char *path)
{
- unsigned int mnt_id;
+ unsigned int mnt_id, x;
ssize_t len;
- char procfile[100], buffer[4096], *p, *nl;
+ char procfile[100], buffer[4096], *p, *q, *nl;
int fd, fd2;

fd = open(path, O_PATH);
@@ -130,12 +130,31 @@ static void get_mntid_by_proc(int ix, const char *path)
exit(3);
}

- sum_mnt_id += mnt_id;
- //printf("[%u] %u\n", ix, mnt_id);
+ /* Now look the ID up on mountfs */
+ sprintf(procfile, "/mnt/%u/parent", mnt_id);
+ fd = open(procfile, O_RDONLY);
+ ERR(fd, procfile);
+ len = read(fd, buffer, sizeof(buffer) - 1);
+ ERR(len, "read/parent");
+ close(fd);
+ if (len > 0 && buffer[len - 1] == '\n')
+ len--;
+ buffer[len] = 0;
+
+ x = strtoul(buffer, &q, 10);
+
+ if (*q) {
+ fprintf(stderr, "Bad format in %s '%s'\n", procfile, buffer);
+ exit(3);
+ }
+
+ sum_check += x;
+ //printf("[%u] %u\n", ix, x);
}

-static void get_mntid_by_fsinfo_2(void)
+static void get_id_by_fsinfo_2(void)
{
+ struct fsinfo_mount_topology t;
struct fsinfo_mount_child *children, *c, *end;
struct fsinfo_mount_info r;
struct fsinfo_params params = {
@@ -171,15 +190,16 @@ static void get_mntid_by_fsinfo_2(void)
for (i = 0; c < end; c++, i++) {
//printf("[%u] %u\n", i, c->mnt_id);
params.flags = FSINFO_FLAGS_QUERY_MOUNT;
- params.request = FSINFO_ATTR_MOUNT_INFO;
+ params.request = FSINFO_ATTR_MOUNT_TOPOLOGY;
sprintf(name, "%u", c->mnt_id);
- ERR(fsinfo(AT_FDCWD, name, &params, sizeof(params), &r, sizeof(r)),
+ ERR(fsinfo(AT_FDCWD, name, &params, sizeof(params), &t, sizeof(t)),
"fsinfo/child");
- sum_mnt_id += r.mnt_id;
+ sum_check += t.parent_id;
+ sum_check_2 += t.mnt_topology_changes;
}
}

-static void get_mntid_by_mountfs(void)
+static void get_id_by_mountfs(void)
{
unsigned int base_mnt_id, mnt_id, x;
ssize_t len, s_children;
@@ -260,11 +280,11 @@ static void get_mntid_by_mountfs(void)
comma++;
}

- sprintf(procfile, "%u/id", mnt_id);
+ sprintf(procfile, "%u/parent", mnt_id);
fd = openat(mntfd, procfile, O_RDONLY);
ERR(fd, procfile);
len = read(fd, buffer, sizeof(buffer) - 1);
- ERR(len, "read/id");
+ ERR(len, "read/parent");
close(fd);
if (len > 0 && buffer[len - 1] == '\n')
len--;
@@ -278,7 +298,7 @@ static void get_mntid_by_mountfs(void)
}

if (0) printf("[%u] %u\n", i++, x);
- sum_mnt_id += x;
+ sum_check += x;
} while (p = comma, *comma);
}

@@ -318,32 +338,32 @@ int main(int argc, char **argv)
iterate(make_mount);

printf("--- test fsinfo by path ---\n");
- sum_mnt_id = 0;
+ sum_check = 0;
ERR(gettimeofday(&f_before, NULL), "gettimeofday");
- iterate(get_mntid_by_fsinfo);
+ iterate(get_id_by_fsinfo);
ERR(gettimeofday(&f_after, NULL), "gettimeofday");
- printf("sum(mnt_id) = %lu\n", sum_mnt_id);
+ printf("sum(mnt_id) = %lu\n", sum_check);

printf("--- test fsinfo by mnt_id ---\n");
- sum_mnt_id = 0;
+ sum_check = 0;
ERR(gettimeofday(&f2_before, NULL), "gettimeofday");
- get_mntid_by_fsinfo_2();
+ get_id_by_fsinfo_2();
ERR(gettimeofday(&f2_after, NULL), "gettimeofday");
- printf("sum(mnt_id) = %lu\n", sum_mnt_id);
+ printf("sum(mnt_id) = %lu\n", sum_check);

printf("--- test /proc/fdinfo ---\n");
- sum_mnt_id = 0;
+ sum_check = 0;
ERR(gettimeofday(&p_before, NULL), "gettimeofday");
- iterate(get_mntid_by_proc);
+ iterate(get_id_by_proc);
ERR(gettimeofday(&p_after, NULL), "gettimeofday");
- printf("sum(mnt_id) = %lu\n", sum_mnt_id);
+ printf("sum(mnt_id) = %lu\n", sum_check);

printf("--- test mountfs ---\n");
- sum_mnt_id = 0;
+ sum_check = 0;
ERR(gettimeofday(&p2_before, NULL), "gettimeofday");
- get_mntid_by_mountfs();
+ get_id_by_mountfs();
ERR(gettimeofday(&p2_after, NULL), "gettimeofday");
- printf("sum(mnt_id) = %lu\n", sum_mnt_id);
+ printf("sum(mnt_id) = %lu\n", sum_check);

f_dur = duration(&f_before, &f_after);
f2_dur = duration(&f2_before, &f2_after);

2020-03-31 21:24:36

by David Howells

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

David Howells <[email protected]> wrote:

> > So even the p2 method will give at least 80k queries/s, which is quite
> > good, considering that the need to rescan the complete mount tree
> > should be exceedingly rare (and in case it mattered, could be
> > optimized by priming from /proc/self/mountinfo).
>
> One thing to note is that the test is actually a little biased in favour of
> the "p" test, where the mnt_id is looked up by path from /proc/fdinfo. That's
> not all that useful, except as an index into mountfs. I'm not sure how much
> use it as a check on whether the mount is the same mount or not since mount
> IDs can get reused.

However, to deal with an overrun, you're going to have to read multiple
attributes. So I've added an attribute file to expose the topology change
counter and it now reads that as well.

For 10000 mounts, f=22899us f2=18240us p=101054us p2=117273us <-- prev email
For 10000 mounts, f=24853us f2=20453us p=235581us p2= 59798us <-- parent_id
For 10000 mounts, f=24621us f2=20528us p=320164us p2=111416us <-- counter

Probably unsurprisingly, this doesn't affect fsinfo() significantly since I've
tried to expose the change counters in relevant places. It does, however,
significantly affect mountfs because you seem to want every value to be
exposed through its own file.

Now this can be worked around by having files that bundle up several values
that are of interest to a particular operation (e.g. rescanning after a
notification queue overrun).

See the attached additional patch. Note that the

sum_check_2 += r.mnt_topology_changes;

bits in the fsinfo() tests accidentally got left in the preceding patch and so
aren't in this one.

David
---
commit 6c62787aec41f67c1d5a55a0d59578854bcef6f8
Author: David Howells <[email protected]>
Date: Tue Mar 31 21:53:11 2020 +0100

Add a mountfs file to export the topology counter

diff --git a/fs/mountfs/super.c b/fs/mountfs/super.c
index 82c01eb6154d..58c05feb4fdd 100644
--- a/fs/mountfs/super.c
+++ b/fs/mountfs/super.c
@@ -22,7 +22,7 @@ struct mountfs_entry {

static const char *mountfs_attrs[] = {
"root", "mountpoint", "id", "parent", "options", "children",
- "group", "master", "propagate_from"
+ "group", "master", "propagate_from", "counter"
};

#define MOUNTFS_INO(id) (((unsigned long) id + 1) * \
@@ -128,6 +128,8 @@ static int mountfs_attr_show(struct seq_file *sf, void *v)
if (tmp)
seq_printf(sf, "%i\n", tmp);
}
+ } else if (strcmp(name, "counter") == 0) {
+ seq_printf(sf, "%u\n", atomic_read(&mnt->mnt_topology_changes));
} else {
WARN_ON(1);
err = -EIO;
diff --git a/samples/vfs/test-fsinfo-perf.c b/samples/vfs/test-fsinfo-perf.c
index 2bcde06ee78b..2b7606a53c2d 100644
--- a/samples/vfs/test-fsinfo-perf.c
+++ b/samples/vfs/test-fsinfo-perf.c
@@ -149,6 +149,26 @@ static void get_id_by_proc(int ix, const char *path)
}

sum_check += x;
+
+ /* And now the topology change counter */
+ sprintf(procfile, "/mnt/%u/counter", mnt_id);
+ fd = open(procfile, O_RDONLY);
+ ERR(fd, procfile);
+ len = read(fd, buffer, sizeof(buffer) - 1);
+ ERR(len, "read/counter");
+ close(fd);
+ if (len > 0 && buffer[len - 1] == '\n')
+ len--;
+ buffer[len] = 0;
+
+ x = strtoul(buffer, &q, 10);
+
+ if (*q) {
+ fprintf(stderr, "Bad format in %s '%s'\n", procfile, buffer);
+ exit(3);
+ }
+
+ sum_check_2 += x;
//printf("[%u] %u\n", ix, x);
}

@@ -204,7 +224,7 @@ static void get_id_by_mountfs(void)
unsigned int base_mnt_id, mnt_id, x;
ssize_t len, s_children;
char procfile[100], buffer[100], *children, *p, *q, *nl, *comma;
- int fd, fd2, mntfd, i;
+ int fd, fd2, mntfd;

/* Start off by reading the mount ID from the base path */
fd = open(base_path, O_PATH);
@@ -269,7 +289,6 @@ static void get_id_by_mountfs(void)
p = children;
if (!*p)
return;
- i = 0;
do {
mnt_id = strtoul(p, &comma, 10);
if (*comma) {
@@ -297,8 +316,26 @@ static void get_id_by_mountfs(void)
exit(3);
}

- if (0) printf("[%u] %u\n", i++, x);
sum_check += x;
+
+ sprintf(procfile, "%u/counter", mnt_id);
+ fd = openat(mntfd, procfile, O_RDONLY);
+ ERR(fd, procfile);
+ len = read(fd, buffer, sizeof(buffer) - 1);
+ ERR(len, "read/counter");
+ close(fd);
+ if (len > 0 && buffer[len - 1] == '\n')
+ len--;
+ buffer[len] = 0;
+
+ x = strtoul(buffer, &q, 10);
+
+ if (*q) {
+ fprintf(stderr, "Bad format in %s '%s'\n", procfile, buffer);
+ exit(3);
+ }
+
+ sum_check_2 += x;
} while (p = comma, *comma);
}


2020-03-31 21:53:39

by David Howells

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

Christian Brauner <[email protected]> wrote:

> querying all properties of a mount atomically all-at-once,

I don't actually offer that, per se.

Having an atomic all-at-once query for a single mount is actually quite a
burden on the system. There's potentially a lot of state involved, much of
which you don't necessarily need.

I've tried to avoid the need to do that by adding change counters that can be
queried cheaply. You read the counters, then you check mounts and superblocks
for which the counters have changed, and then you re-read the counters. I've
added multiple counters, assigned to different purposes, to make it easier to
pin down what has changed - and so reduce the amount of checking required.

What I have added to fsinfo() is a way to atomically retrieve a list of all
the children of a mount, including, for each mount, the mount ID (which may
have been reused), a uniquifier (which shouldn't wrap over the kernel
lifetime) and the sum of the mount object and superblock change counters.

This should allow you to quickly rescan the mount tree as fsinfo() can look up
mounts by mount ID instead of by path or fd.

Below is a sample file from the kernel that scans by this method, displaying
an ascii art tree of all the mounts under a path or mount.

David
---
// SPDX-License-Identifier: GPL-2.0-or-later
/* Test the fsinfo() system call
*
* Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
* Written by David Howells ([email protected])
*/

#define _GNU_SOURCE
#define _ATFILE_SOURCE
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <unistd.h>
#include <ctype.h>
#include <errno.h>
#include <time.h>
#include <math.h>
#include <sys/syscall.h>
#include <linux/fsinfo.h>
#include <linux/socket.h>
#include <linux/fcntl.h>
#include <sys/stat.h>
#include <arpa/inet.h>

#ifndef __NR_fsinfo
#define __NR_fsinfo -1
#endif

static __attribute__((unused))
ssize_t fsinfo(int dfd, const char *filename,
struct fsinfo_params *params, size_t params_size,
void *result_buffer, size_t result_buf_size)
{
return syscall(__NR_fsinfo, dfd, filename,
params, params_size,
result_buffer, result_buf_size);
}

static char tree_buf[4096];
static char bar_buf[4096];
static unsigned int children_list_interval;

/*
* Get an fsinfo attribute in a statically allocated buffer.
*/
static void get_attr(unsigned int mnt_id, unsigned int attr, unsigned int Nth,
void *buf, size_t buf_size)
{
struct fsinfo_params params = {
.flags = FSINFO_FLAGS_QUERY_MOUNT,
.request = attr,
.Nth = Nth,
};
char file[32];
long ret;

sprintf(file, "%u", mnt_id);

memset(buf, 0xbd, buf_size);

ret = fsinfo(AT_FDCWD, file, &params, sizeof(params), buf, buf_size);
if (ret == -1) {
fprintf(stderr, "mount-%s: %m\n", file);
exit(1);
}
}

/*
* Get an fsinfo attribute in a dynamically allocated buffer.
*/
static void *get_attr_alloc(unsigned int mnt_id, unsigned int attr,
unsigned int Nth, size_t *_size)
{
struct fsinfo_params params = {
.flags = FSINFO_FLAGS_QUERY_MOUNT,
.request = attr,
.Nth = Nth,
};
size_t buf_size = 4096;
char file[32];
void *r;
long ret;

sprintf(file, "%u", mnt_id);

for (;;) {
r = malloc(buf_size);
if (!r) {
perror("malloc");
exit(1);
}
memset(r, 0xbd, buf_size);

ret = fsinfo(AT_FDCWD, file, &params, sizeof(params), r, buf_size);
if (ret == -1) {
fprintf(stderr, "mount-%s: %x,%x,%x %m\n",
file, params.request, params.Nth, params.Mth);
exit(1);
}

if (ret <= buf_size) {
*_size = ret;
break;
}
buf_size = (ret + 4096 - 1) & ~(4096 - 1);
}

return r;
}

/*
* Display a mount and then recurse through its children.
*/
static void display_mount(unsigned int mnt_id, unsigned int depth, char *path)
{
struct fsinfo_mount_topology top;
struct fsinfo_mount_child child;
struct fsinfo_mount_info info;
struct fsinfo_ids ids;
void *children;
unsigned int d;
size_t ch_size, p_size;
char dev[64];
int i, n, s;

get_attr(mnt_id, FSINFO_ATTR_MOUNT_TOPOLOGY, 0, &top, sizeof(top));
get_attr(mnt_id, FSINFO_ATTR_MOUNT_INFO, 0, &info, sizeof(info));
get_attr(mnt_id, FSINFO_ATTR_IDS, 0, &ids, sizeof(ids));
if (depth > 0)
printf("%s", tree_buf);

s = strlen(path);
printf("%s", !s ? "\"\"" : path);
if (!s)
s += 2;
s += depth;
if (s < 38)
s = 38 - s;
else
s = 1;
printf("%*.*s", s, s, "");

sprintf(dev, "%x:%x", ids.f_dev_major, ids.f_dev_minor);
printf("%10u %8x %2x %x %5s %s",
info.mnt_id,
(info.sb_changes +
info.sb_notifications +
info.mnt_attr_changes +
info.mnt_topology_changes +
info.mnt_subtree_notifications),
info.attr, top.propagation,
dev, ids.f_fs_name);
putchar('\n');

children = get_attr_alloc(mnt_id, FSINFO_ATTR_MOUNT_CHILDREN, 0, &ch_size);
n = ch_size / children_list_interval - 1;

bar_buf[depth + 1] = '|';
if (depth > 0) {
tree_buf[depth - 4 + 1] = bar_buf[depth - 4 + 1];
tree_buf[depth - 4 + 2] = ' ';
}

tree_buf[depth + 0] = ' ';
tree_buf[depth + 1] = '\\';
tree_buf[depth + 2] = '_';
tree_buf[depth + 3] = ' ';
tree_buf[depth + 4] = 0;
d = depth + 4;

memset(&child, 0, sizeof(child));
for (i = 0; i < n; i++) {
void *p = children + i * children_list_interval;

if (sizeof(child) >= children_list_interval)
memcpy(&child, p, children_list_interval);
else
memcpy(&child, p, sizeof(child));

if (i == n - 1)
bar_buf[depth + 1] = ' ';
path = get_attr_alloc(child.mnt_id, FSINFO_ATTR_MOUNT_POINT,
0, &p_size);
display_mount(child.mnt_id, d, path + 1);
free(path);
}

free(children);
if (depth > 0) {
tree_buf[depth - 4 + 1] = '\\';
tree_buf[depth - 4 + 2] = '_';
}
tree_buf[depth] = 0;
}

/*
* Find the ID of whatever is at the nominated path.
*/
static unsigned int lookup_mnt_by_path(const char *path)
{
struct fsinfo_mount_info mnt;
struct fsinfo_params params = {
.flags = FSINFO_FLAGS_QUERY_PATH,
.request = FSINFO_ATTR_MOUNT_INFO,
};

if (fsinfo(AT_FDCWD, path, &params, sizeof(params), &mnt, sizeof(mnt)) == -1) {
perror(path);
exit(1);
}

return mnt.mnt_id;
}

/*
* Determine the element size for the mount child list.
*/
static unsigned int query_list_element_size(int mnt_id, unsigned int attr)
{
struct fsinfo_attribute_info attr_info;

get_attr(mnt_id, FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, attr,
&attr_info, sizeof(attr_info));
return attr_info.size;
}

/*
*
*/
int main(int argc, char **argv)
{
unsigned int mnt_id;
char *path;
bool use_mnt_id = false;
int opt;

while ((opt = getopt(argc, argv, "m"))) {
switch (opt) {
case 'm':
use_mnt_id = true;
continue;
}
break;
}

argc -= optind;
argv += optind;

switch (argc) {
case 0:
mnt_id = lookup_mnt_by_path("/");
path = "ROOT";
break;
case 1:
path = argv[0];
if (use_mnt_id) {
mnt_id = strtoul(argv[0], NULL, 0);
break;
}

mnt_id = lookup_mnt_by_path(argv[0]);
break;
default:
printf("Format: test-mntinfo\n");
printf("Format: test-mntinfo <path>\n");
printf("Format: test-mntinfo -m <mnt_id>\n");
exit(2);
}

children_list_interval =
query_list_element_size(mnt_id, FSINFO_ATTR_MOUNT_CHILDREN);

printf("MOUNT MOUNT ID CHANGE# AT P DEV TYPE\n");
printf("------------------------------------- ---------- -------- -- - ----- --------\n");
display_mount(mnt_id, 0, path);
return 0;
}

2020-03-31 21:55:02

by David Howells

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

Karel Zak <[email protected]> wrote:

> - improve fsinfo() to provide set (list) of the attributes by one call

That would be my preferred way. I wouldn't want to let the user pin copies of
state, and I wouldn't want to make open(O_PATH) do it automatically.

David

2020-03-31 21:57:26

by David Howells

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

Lennart Poettering <[email protected]> wrote:

> - We also have code that needs to check if /dev/ is plain tmpfs or
> devtmpfs. We cannot use statfs for that, since in both cases
> TMPFS_MAGIC is reported, hence we currently parse
> /proc/self/mountinfo for that to find the fstype string there, which
> is different for both cases.

btw, fsinfo(FSINFO_ATTR_IDS) gets you the name of the filesystem type in
addition to the magic number.

David

2020-04-01 08:45:51

by Karel Zak

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Tue, Mar 31, 2020 at 10:54:23PM +0100, David Howells wrote:
> Karel Zak <[email protected]> wrote:
>
> > - improve fsinfo() to provide set (list) of the attributes by one call
>
> That would be my preferred way. I wouldn't want to let the user pin copies of
> state, and I wouldn't want to make open(O_PATH) do it automatically.

You can create cow object on first fsinfo() call, ideally add some
flags to control this behavior -- but you're right, this way is
complicated to implement and possibly dangerous.

I guess return some vector of attributes in one fsinfo() will be good
enough.

Karel

--
Karel Zak <[email protected]>
http://karelzak.blogspot.com

2020-04-01 09:06:15

by Karel Zak

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Tue, Mar 31, 2020 at 10:52:52PM +0100, David Howells wrote:
> Christian Brauner <[email protected]> wrote:
>
> > querying all properties of a mount atomically all-at-once,
>
> I don't actually offer that, per se.
>
> Having an atomic all-at-once query for a single mount is actually quite a
> burden on the system. There's potentially a lot of state involved, much of
> which you don't necessarily need.

If all means "all possible attributes" than it is unnecessary, for
example ext4 timestamps or volume uuid/label are rarely necessary.
We usually need together (as consistent set):

source
mountpoint
FS type
FS root (FSINFO_ATTR_MOUNT_PATH)
FS options (FSINFO_ATTR_CONFIGURATION)
VFS attributes
VFS propagation flags
mount ID
parent ID
devno (or maj:min)

Karel

--
Karel Zak <[email protected]>
http://karelzak.blogspot.com

2020-04-01 13:38:07

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Wed, Apr 1, 2020 at 11:05 AM Karel Zak <[email protected]> wrote:
>
> On Tue, Mar 31, 2020 at 10:52:52PM +0100, David Howells wrote:
> > Christian Brauner <[email protected]> wrote:
> >
> > > querying all properties of a mount atomically all-at-once,
> >
> > I don't actually offer that, per se.
> >
> > Having an atomic all-at-once query for a single mount is actually quite a
> > burden on the system. There's potentially a lot of state involved, much of
> > which you don't necessarily need.
>
> If all means "all possible attributes" than it is unnecessary, for
> example ext4 timestamps or volume uuid/label are rarely necessary.
> We usually need together (as consistent set):
>
> source
> mountpoint
> FS type
> FS root (FSINFO_ATTR_MOUNT_PATH)
> FS options (FSINFO_ATTR_CONFIGURATION)
> VFS attributes
> VFS propagation flags
> mount ID
> parent ID
> devno (or maj:min)

This is trivial with mountfs (reuse format of /proc/PID/mountinfo):

# cat /mnt/30/info
30 20 0:14 / /mnt rw,relatime - mountfs none rw

Attached patch applies against readfile patch.

We might want something more generic, and it won't get any simpler:

mount.h | 1 +
mountfs/super.c | 12 +++++++++++-
proc_namespace.c | 2 +-
3 files changed, 13 insertions(+), 2 deletions(-)

Thanks,
Miklos


Attachments:
mountfs-info.patch (1.73 kB)

2020-04-01 13:58:45

by David Howells

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

Miklos Szeredi <[email protected]> wrote:

> Attached patch applies against readfile patch.

But doesn't actually do what Karel asked for. show_mountinfo() itself does
not give you what Karel asked for. Plus there's more information you need to
add to it.

David

2020-04-01 14:42:09

by Lennart Poettering

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Di, 31.03.20 22:52, David Howells ([email protected]) wrote:

> Christian Brauner <[email protected]> wrote:
>
> > querying all properties of a mount atomically all-at-once,
>
> I don't actually offer that, per se.
>
> Having an atomic all-at-once query for a single mount is actually quite a
> burden on the system. There's potentially a lot of state involved, much of
> which you don't necessarily need.

Hmm, do it like with statx() and specify a mask for the fields userspace
wants? Then it would be as lightweight as it possibly could be?

Lennart

--
Lennart Poettering, Berlin

2020-04-01 15:09:29

by David Howells

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

David Howells <[email protected]> wrote:

> > Attached patch applies against readfile patch.
>
> But doesn't actually do what Karel asked for. show_mountinfo() itself does
> not give you what Karel asked for. Plus there's more information you need to
> add to it.

And arguably, it's worse than just reading /proc/mounts. If you get a
notification that something changed (ie. you poll /proc/mounts or mount
notifications gives you an overrun) you now have to read *every*
/mountfs/*/info file. That is way more expensive.

David

2020-04-01 15:36:12

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Wed, Apr 1, 2020 at 4:41 PM Lennart Poettering <[email protected]> wrote:
>
> On Di, 31.03.20 22:52, David Howells ([email protected]) wrote:
>
> > Christian Brauner <[email protected]> wrote:
> >
> > > querying all properties of a mount atomically all-at-once,
> >
> > I don't actually offer that, per se.
> >
> > Having an atomic all-at-once query for a single mount is actually quite a
> > burden on the system. There's potentially a lot of state involved, much of
> > which you don't necessarily need.
>
> Hmm, do it like with statx() and specify a mask for the fields userspace
> wants? Then it would be as lightweight as it possibly could be?

Yes, however binary structures mixed with variable length fields are
not going to be pretty.

Again, if we want something even halfway sane for a syscall interface,
go with a string key/value vector.

If that's really needed. I've still not heard a convincing argument
in favor of a syscall.

Thanks,
Miklos

2020-04-01 16:04:07

by David Howells

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

Miklos Szeredi <[email protected]> wrote:

> > > But doesn't actually do what Karel asked for. show_mountinfo() itself does
> > > not give you what Karel asked for.
>
> Not sure what you mean. I think it shows precisely the information
> Karel asked for.

It's not atomic.

David

2020-04-01 16:09:17

by David Howells

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

Miklos Szeredi <[email protected]> wrote:

> I've still not heard a convincing argument in favor of a syscall.

From your own results, scanning 10000 mounts through mountfs and reading just
two values from each is an order of magnitude slower without the effect of the
dentry/inode caches. It gets faster on the second run because the mountfs
dentries and inodes are cached - but at a cost of >205MiB of RAM. And it's
*still* slower than fsinfo().

David

2020-04-01 16:42:15

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Wed, Apr 1, 2020 at 6:07 PM David Howells <[email protected]> wrote:
>
> Miklos Szeredi <[email protected]> wrote:
>
> > I've still not heard a convincing argument in favor of a syscall.
>
> From your own results, scanning 10000 mounts through mountfs and reading just
> two values from each is an order of magnitude slower without the effect of the
> dentry/inode caches. It gets faster on the second run because the mountfs
> dentries and inodes are cached - but at a cost of >205MiB of RAM. And it's
> *still* slower than fsinfo().

Already told you that we can just delete the dentry on dput_final, so
the memory argument is immaterial.

And the speed argument also, because there's no use case where that
would make a difference. You keep bringing up the notification queue
overrun when watching a subtree, but that's going to be painful with
fsinfo(2) as well. If that's a relevant use case (not saying it's
true), might as well add a /mnt/MNT_ID/subtree_info (trivial again)
that contains all information for the subtree. Have fun implementing
that with fsinfo(2).

Thanks,
Miklos

2020-04-01 16:55:50

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Wed, Apr 1, 2020 at 6:02 PM David Howells <[email protected]> wrote:
>
> Miklos Szeredi <[email protected]> wrote:
>
> > > > But doesn't actually do what Karel asked for. show_mountinfo() itself does
> > > > not give you what Karel asked for.
> >
> > Not sure what you mean. I think it shows precisely the information
> > Karel asked for.
>
> It's not atomic.

Yes it is.

Thanks,
Miklos

2020-04-01 17:04:19

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Wed, Apr 1, 2020 at 3:58 PM David Howells <[email protected]> wrote:
>
> David Howells <[email protected]> wrote:
>
> > > Attached patch applies against readfile patch.
> >
> > But doesn't actually do what Karel asked for. show_mountinfo() itself does
> > not give you what Karel asked for.

Not sure what you mean. I think it shows precisely the information
Karel asked for.

> Plus there's more information you need to
> > add to it.

The mountinfo format is extensible (see
Documentation/filesystems/proc.txt) so for example adding the change
counters would be simple.

> And arguably, it's worse than just reading /proc/mounts. If you get a
> notification that something changed (ie. you poll /proc/mounts or mount
> notifications gives you an overrun) you now have to read *every*
> /mountfs/*/info file. That is way more expensive.

fsinfo(2) will never be substantially cheaper than reading and parsing
/mnt/MNT_ID/info. In fact reading a large part of the mount table
using fsinfo(2) will be substantially slower than parsing
/proc/self/mountinfo (this doesn't actually do the parsing but that
would add a very small amount of overhead):

root@kvm:~# ./test-fsinfo-perf /tmp/a 30000
--- make mounts ---
--- test fsinfo by path ---
sum(mnt_id) = 960000
--- test fsinfo by mnt_id ---
sum(mnt_id) = 960000
--- test /proc/fdinfo ---
sum(mnt_id) = 960000
--- test mountfs ---
sum(mnt_id) = 960000
--- test mountinfo ---
sum(mnt_id) = 960000
For 30000 mounts, f= 154963us f2= 148337us p= 1803699us p2=
257019us; m= 53996us; p=11.6*f p=12.2*f2 p=7.0*p2 p=33.4*m
--- umount ---

Yes, that's 33 times faster!

Thanks,
Miklos


Attachments:
test-fsinfo-perf-mountinfo.patch (3.62 kB)

2020-04-02 02:54:18

by Ian Kent

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Wed, 2020-04-01 at 18:40 +0200, Miklos Szeredi wrote:
> On Wed, Apr 1, 2020 at 6:07 PM David Howells <[email protected]>
> wrote:
> > Miklos Szeredi <[email protected]> wrote:
> >
> > > I've still not heard a convincing argument in favor of a syscall.
> >
> > From your own results, scanning 10000 mounts through mountfs and
> > reading just
> > two values from each is an order of magnitude slower without the
> > effect of the
> > dentry/inode caches. It gets faster on the second run because the
> > mountfs
> > dentries and inodes are cached - but at a cost of >205MiB of
> > RAM. And it's
> > *still* slower than fsinfo().
>
> Already told you that we can just delete the dentry on dput_final, so
> the memory argument is immaterial.
>
> And the speed argument also, because there's no use case where that
> would make a difference. You keep bringing up the notification queue
> overrun when watching a subtree, but that's going to be painful with
> fsinfo(2) as well. If that's a relevant use case (not saying it's
> true), might as well add a /mnt/MNT_ID/subtree_info (trivial again)
> that contains all information for the subtree. Have fun implementing
> that with fsinfo(2).

Forgive me for not trawling through your patch to work this out
but how does a poll on a path get what's needed to get mount info.

Or, more specifically, how does one get what's needed to go directly
to the place to get mount info. when something in the tree under the
polled path changes (mount/umount). IIUC poll alone won't do subtree
change monitoring?

Don't get me wrong, neither the proc nor the fsinfo implementations
deal with the notification storms that cause much of the problem we
see now.

IMHO that's a separate and very difficult problem in itself that
can't even be considered until getting the information efficiently
is resolved.

Ian

2020-04-02 13:53:34

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Thu, Apr 2, 2020 at 4:52 AM Ian Kent <[email protected]> wrote:
>
> On Wed, 2020-04-01 at 18:40 +0200, Miklos Szeredi wrote:
> > On Wed, Apr 1, 2020 at 6:07 PM David Howells <[email protected]>
> > wrote:
> > > Miklos Szeredi <[email protected]> wrote:
> > >
> > > > I've still not heard a convincing argument in favor of a syscall.
> > >
> > > From your own results, scanning 10000 mounts through mountfs and
> > > reading just
> > > two values from each is an order of magnitude slower without the
> > > effect of the
> > > dentry/inode caches. It gets faster on the second run because the
> > > mountfs
> > > dentries and inodes are cached - but at a cost of >205MiB of
> > > RAM. And it's
> > > *still* slower than fsinfo().
> >
> > Already told you that we can just delete the dentry on dput_final, so
> > the memory argument is immaterial.
> >
> > And the speed argument also, because there's no use case where that
> > would make a difference. You keep bringing up the notification queue
> > overrun when watching a subtree, but that's going to be painful with
> > fsinfo(2) as well. If that's a relevant use case (not saying it's
> > true), might as well add a /mnt/MNT_ID/subtree_info (trivial again)
> > that contains all information for the subtree. Have fun implementing
> > that with fsinfo(2).
>
> Forgive me for not trawling through your patch to work this out
> but how does a poll on a path get what's needed to get mount info.
>
> Or, more specifically, how does one get what's needed to go directly
> to the place to get mount info. when something in the tree under the
> polled path changes (mount/umount). IIUC poll alone won't do subtree
> change monitoring?

The mechanisms are basically the same as with fsinfo(2). You can get
to the mountfs entry through the mount ID or through a proc/fd/ type
symlink. So if you have a path, there are two options:

- find out the mount ID belonging to that path and go to /mountfs/$mntid/
- open the path with fd = open(path, O_PATH) and the go to
/proc/self/fdmount/$fd/

Currently the only way to find the mount id from a path is by parsing
/proc/self/fdinfo/$fd. It is trivial, however, to extend statx(2) to
return it directly from a path. Also the mount notification queue
that David implemented contains the mount ID of the changed mount.

> Don't get me wrong, neither the proc nor the fsinfo implementations
> deal with the notification storms that cause much of the problem we
> see now.
>
> IMHO that's a separate and very difficult problem in itself that
> can't even be considered until getting the information efficiently
> is resolved.

This mount notification storm issue got me thinking. If I understand
correctly, systemd wants mount notifications so that it can do the
desktop pop-up thing. Is that correct?

But that doesn't apply to automounts at all. A new mount performed by
automount is uninteresting to to desktops, since it's triggered by
crossing the automount point (i.e. a normal path lookup), not an
external event like inserting a usb stick, etc...

Am I missing something?

Maybe the solution is to just allow filtering out such notifications
at the source, so automount triggers don't generate events for
systemd.

Thanks,
Miklos

2020-04-02 15:24:59

by David Howells

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

David Howells <[email protected]> wrote:

> ext4_show_mount()

ext4_show_options(), sorry.

David

2020-04-02 15:26:46

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Thu, Apr 2, 2020 at 5:23 PM David Howells <[email protected]> wrote:
>
> Miklos Szeredi <[email protected]> wrote:
>
> > > > Not sure what you mean. I think it shows precisely the information
> > > > Karel asked for.
> > >
> > > It's not atomic.
> >
> > Yes it is.
>
> No, it really isn't - though it could be made so.
>
> ext4_show_mount(), for example, doesn't lock against "mount -o remount", so
> the configuration can be changing whilst it's being rendered to text.

Does s_umount nest inside namespace_sem? I really don't see the
relation of those locks.

Thanks,
Miklos

2020-04-02 15:38:09

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Thu, Apr 2, 2020 at 5:28 PM Lennart Poettering <[email protected]> wrote:
>
> On Do, 02.04.20 17:22, Miklos Szeredi ([email protected]) wrote:
>
> > On Thu, Apr 2, 2020 at 4:36 PM Lennart Poettering <[email protected]> wrote:
> >
> > > You appear to be thinking about the "udisks" project or so?
> >
> > Probably.
> >
> > The real question is: is there a sane way to filter mount
> > notifications so that systemd receives only those which it is
> > interested in, rather than the tens of thousands that for example
> > autofs is managing and has nothing to do with systemd?
>
> systemd cares about all mount points in PID1's mount namespace.
>
> The fact that mount tables can grow large is why we want something
> better than constantly reparsing the whole /proc/self/mountinfo. But
> filtering subsets of that is something we don't really care about.

I can accept that, but you haven't given a reason why that's so.

What does it do with the fact that an automount point was crossed, for
example? How does that affect the operation of systemd?

Thanks,
Miklos

2020-04-02 15:44:45

by David Howells

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

Miklos Szeredi <[email protected]> wrote:

> > ext4_show_mount(), for example, doesn't lock against "mount -o remount", so
> > the configuration can be changing whilst it's being rendered to text.
>
> Does s_umount nest inside namespace_sem? I really don't see the
> relation of those locks.

If I understand aright what Al has told me, it's a bad idea to do any blocking
operation inside of namespace_sem apart from kmalloc(GFP_KERNEL).

David

2020-04-02 15:52:22

by David Howells

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

Lennart Poettering <[email protected]> wrote:

> systemd cares about all mount points in PID1's mount namespace.
>
> The fact that mount tables can grow large is why we want something
> better than constantly reparsing the whole /proc/self/mountinfo. But
> filtering subsets of that is something we don't really care about.

With the notifications stuff I've done, you can do, for example:

pipe2(pipefd, O_NOTIFICATION_PIPE);
ioctl(pipefd[0], IOC_WATCH_QUEUE_SET_SIZE, 256);
watch_mount(AT_FDCWD, "/", 0, pipefd[0], 0x02);

And that will catch all mount object changes in the subtree rooted at the
given path, in this case "/".

If you want to limit it to just the notifications on that mount, you would
need to install a filter:

struct watch_notification_filter filter = {
.nr_filters = 1,
.filters = {
[0] = {
.type = WATCH_TYPE_MOUNT_NOTIFY,
.subtype_filter[0]= UINT_MAX,
.info_mask = NOTIFY_MOUNT_IS_RECURSIVE,
.info_filter = 0,
},
},
};
ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);

Note that this doesn't monitor for superblock changes and events. They must
be watched individually with something like:

watch_sb(AT_FDCWD, "/afs", AT_NO_AUTOMOUNT, pipefd[0], 0x27);

David

2020-04-02 15:57:20

by David Howells

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

David Howells <[email protected]> wrote:

> .info_mask = NOTIFY_MOUNT_IS_RECURSIVE,

Sorry, I meant NOTIFY_MOUNT_IN_SUBTREE; NOTIFY_MOUNT_IS_RECURSIVE indicates
that the operation was recursive in nature.

David

2020-04-02 16:29:24

by Lennart Poettering

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Do, 02.04.20 15:52, Miklos Szeredi ([email protected]) wrote:

> > Don't get me wrong, neither the proc nor the fsinfo implementations
> > deal with the notification storms that cause much of the problem we
> > see now.
> >
> > IMHO that's a separate and very difficult problem in itself that
> > can't even be considered until getting the information efficiently
> > is resolved.
>
> This mount notification storm issue got me thinking. If I understand
> correctly, systemd wants mount notifications so that it can do the
> desktop pop-up thing. Is that correct?

This has little to do with the desktop. Startup scheduling is
mostly about figuring out when we can do the next step of startup, and
to a big amount this means issuing a mount command of some form, then
waiting until it is established, then invoking the next and so on, and
when the right mounts are established start the right services that
require them and so on. And with today's system complexity with
storage daemons and so on this all becomes a complex network of
concurrent dependencies.

Most mounts are established on behalf of pid 1 itself, for those we
could just wait until the mount syscall/command completes (and we
do). But there's plenty cases where that's not the case, hence we need
to make sure we follow system mount table state as a whole, regardless
if its systemd itself that triggers some mount or something else (for
example some shell script, udisks, …).

> But that doesn't apply to automounts at all. A new mount performed by
> automount is uninteresting to to desktops, since it's triggered by
> crossing the automount point (i.e. a normal path lookup), not an
> external event like inserting a usb stick, etc...

systemd does not propagate mount events to desktops.

You appear to be thinking about the "udisks" project or so?

Lennart

--
Lennart Poettering, Berlin

2020-04-02 16:39:46

by David Howells

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

Miklos Szeredi <[email protected]> wrote:

> > > Not sure what you mean. I think it shows precisely the information
> > > Karel asked for.
> >
> > It's not atomic.
>
> Yes it is.

No, it really isn't - though it could be made so.

ext4_show_mount(), for example, doesn't lock against "mount -o remount", so
the configuration can be changing whilst it's being rendered to text.

David

2020-04-02 16:40:09

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Thu, Apr 2, 2020 at 4:36 PM Lennart Poettering <[email protected]> wrote:

> You appear to be thinking about the "udisks" project or so?

Probably.

The real question is: is there a sane way to filter mount
notifications so that systemd receives only those which it is
interested in, rather than the tens of thousands that for example
autofs is managing and has nothing to do with systemd?

Is there a specific mountpoint or mountpoints that systemd is waiting
for? How exactly does this work?

Thanks,
Miklos

2020-04-02 16:42:18

by Lennart Poettering

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Do, 02.04.20 17:22, Miklos Szeredi ([email protected]) wrote:

> On Thu, Apr 2, 2020 at 4:36 PM Lennart Poettering <[email protected]> wrote:
>
> > You appear to be thinking about the "udisks" project or so?
>
> Probably.
>
> The real question is: is there a sane way to filter mount
> notifications so that systemd receives only those which it is
> interested in, rather than the tens of thousands that for example
> autofs is managing and has nothing to do with systemd?

systemd cares about all mount points in PID1's mount namespace.

The fact that mount tables can grow large is why we want something
better than constantly reparsing the whole /proc/self/mountinfo. But
filtering subsets of that is something we don't really care about.

Lennart

--
Lennart Poettering, Berlin

2020-04-02 16:49:15

by Lennart Poettering

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Do, 02.04.20 17:35, Miklos Szeredi ([email protected]) wrote:

> > systemd cares about all mount points in PID1's mount namespace.
> >
> > The fact that mount tables can grow large is why we want something
> > better than constantly reparsing the whole /proc/self/mountinfo. But
> > filtering subsets of that is something we don't really care about.
>
> I can accept that, but you haven't given a reason why that's so.
>
> What does it do with the fact that an automount point was crossed, for
> example? How does that affect the operation of systemd?

We don't care how a mount point came to be. If it's autofs or
something else, we don't care. We don't access these mount points
ourselves ever, we just watch their existance.

I mean, it's not just about startup it's also about shutdown. At
shutdown we need to unmount everything from the leaves towards the
root so that all file systems are in a clean state. And that means
*all* mounts, even autofs ones, even udisks ones, or whatever else
established them, we don't care. I mean, the autofs daemon can die any
time, we still must be able to sensibly shutdown, and thus unmount all
mounts inside some autofs hierarchy at the right time, before
unmounting the autofs top-level dir and then what might be further up
the tree.

systemd needs to know the whole tree, to figure out deps properly for
things like that, hence we aren't interested in filtering, we are
interested in minimizing what we do when something changes.

Lennart

--
Lennart Poettering, Berlin

2020-04-02 17:22:03

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Thu, Apr 2, 2020 at 5:50 PM Lennart Poettering <[email protected]> wrote:
>
> On Do, 02.04.20 17:35, Miklos Szeredi ([email protected]) wrote:
>
> > > systemd cares about all mount points in PID1's mount namespace.
> > >
> > > The fact that mount tables can grow large is why we want something
> > > better than constantly reparsing the whole /proc/self/mountinfo. But
> > > filtering subsets of that is something we don't really care about.
> >
> > I can accept that, but you haven't given a reason why that's so.
> >
> > What does it do with the fact that an automount point was crossed, for
> > example? How does that affect the operation of systemd?
>
> We don't care how a mount point came to be. If it's autofs or
> something else, we don't care. We don't access these mount points
> ourselves ever, we just watch their existance.
>
> I mean, it's not just about startup it's also about shutdown. At
> shutdown we need to unmount everything from the leaves towards the
> root so that all file systems are in a clean state.

Unfortunately that's not guaranteed by umounting all filesystems from
the init namespace. A filesystem is shut down when all references to
it are gone. Perhaps you instead want to lazy unmount root (yeah,
that may not actually be allowed, but anyway, lazy unmounting the top
level ones should do) and watch for super block shutdown events
instead.

Does that make any sense?

Thanks,
Miklos

2020-04-03 01:51:59

by Ian Kent

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Thu, 2020-04-02 at 15:52 +0200, Miklos Szeredi wrote:
> On Thu, Apr 2, 2020 at 4:52 AM Ian Kent <[email protected]> wrote:
> > On Wed, 2020-04-01 at 18:40 +0200, Miklos Szeredi wrote:
> > > On Wed, Apr 1, 2020 at 6:07 PM David Howells <[email protected]
> > > >
> > > wrote:
> > > > Miklos Szeredi <[email protected]> wrote:
> > > >
> > > > > I've still not heard a convincing argument in favor of a
> > > > > syscall.
> > > >
> > > > From your own results, scanning 10000 mounts through mountfs
> > > > and
> > > > reading just
> > > > two values from each is an order of magnitude slower without
> > > > the
> > > > effect of the
> > > > dentry/inode caches. It gets faster on the second run because
> > > > the
> > > > mountfs
> > > > dentries and inodes are cached - but at a cost of >205MiB of
> > > > RAM. And it's
> > > > *still* slower than fsinfo().
> > >
> > > Already told you that we can just delete the dentry on
> > > dput_final, so
> > > the memory argument is immaterial.
> > >
> > > And the speed argument also, because there's no use case where
> > > that
> > > would make a difference. You keep bringing up the notification
> > > queue
> > > overrun when watching a subtree, but that's going to be painful
> > > with
> > > fsinfo(2) as well. If that's a relevant use case (not saying
> > > it's
> > > true), might as well add a /mnt/MNT_ID/subtree_info (trivial
> > > again)
> > > that contains all information for the subtree. Have fun
> > > implementing
> > > that with fsinfo(2).
> >
> > Forgive me for not trawling through your patch to work this out
> > but how does a poll on a path get what's needed to get mount info.
> >
> > Or, more specifically, how does one get what's needed to go
> > directly
> > to the place to get mount info. when something in the tree under
> > the
> > polled path changes (mount/umount). IIUC poll alone won't do
> > subtree
> > change monitoring?
>
> The mechanisms are basically the same as with fsinfo(2). You can
> get
> to the mountfs entry through the mount ID or through a proc/fd/ type
> symlink. So if you have a path, there are two options:
>
> - find out the mount ID belonging to that path and go to
> /mountfs/$mntid/
> - open the path with fd = open(path, O_PATH) and the go to
> /proc/self/fdmount/$fd/
>
> Currently the only way to find the mount id from a path is by parsing
> /proc/self/fdinfo/$fd. It is trivial, however, to extend statx(2) to
> return it directly from a path. Also the mount notification queue
> that David implemented contains the mount ID of the changed mount.

I'm aware the mount id comes through David's notifications, I was
wondering how to get that via your recommendation, thanks.

In your scheme it sounds like the mount id doesn't hold the
importance it deserves, it's central to the whole idea of getting
information about these mounts. But it sounds like you need to
open fds to paths you might not know to find it ...

Your explanation wasn't clear on how one gets notifications of
events within a tree under a mount you've opened an fd on to
get events?

>
> > Don't get me wrong, neither the proc nor the fsinfo implementations
> > deal with the notification storms that cause much of the problem we
> > see now.
> >
> > IMHO that's a separate and very difficult problem in itself that
> > can't even be considered until getting the information efficiently
> > is resolved.
>
> This mount notification storm issue got me thinking. If I
> understand
> correctly, systemd wants mount notifications so that it can do the
> desktop pop-up thing. Is that correct?
>
> But that doesn't apply to automounts at all. A new mount performed
> by
> automount is uninteresting to to desktops, since it's triggered by
> crossing the automount point (i.e. a normal path lookup), not an
> external event like inserting a usb stick, etc...
>
> Am I missing something?

Yeah, you're not missing anything.

Unfortunately, in a recent discussion on the autofs mailing list,
an investigation showed that systemd does want/get events for
autofs mounts and proceeds to issue around a 100 or so events on
the d-bus for every one.

>
> Maybe the solution is to just allow filtering out such notifications
> at the source, so automount triggers don't generate events for
> systemd.

Except that autofs automounts might be expected to be seen on a
desktop, that's not out of the question I guess.

Ian

2020-04-03 09:14:11

by Karel Zak

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Wed, Apr 01, 2020 at 05:25:54PM +0200, Miklos Szeredi wrote:
> fsinfo(2) will never be substantially cheaper than reading and parsing
> /mnt/MNT_ID/info. In fact reading a large part of the mount table
> using fsinfo(2) will be substantially slower than parsing
> /proc/self/mountinfo (this doesn't actually do the parsing but that
> would add a very small amount of overhead):

I think nobody wants to use fsinfo() or mountfs as replacement to whole
/proc/self/mountinfo. It does not make sense. We need per-mountpoint
API, for whole mount table use-cases (like findmnt or lsblk) the
current mountinfo is good enough.

Karel

--
Karel Zak <[email protected]>
http://karelzak.blogspot.com

2020-04-03 11:09:33

by Lennart Poettering

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Do, 02.04.20 19:20, Miklos Szeredi ([email protected]) wrote:

> On Thu, Apr 2, 2020 at 5:50 PM Lennart Poettering <[email protected]> wrote:
> >
> > On Do, 02.04.20 17:35, Miklos Szeredi ([email protected]) wrote:
> >
> > > > systemd cares about all mount points in PID1's mount namespace.
> > > >
> > > > The fact that mount tables can grow large is why we want something
> > > > better than constantly reparsing the whole /proc/self/mountinfo. But
> > > > filtering subsets of that is something we don't really care about.
> > >
> > > I can accept that, but you haven't given a reason why that's so.
> > >
> > > What does it do with the fact that an automount point was crossed, for
> > > example? How does that affect the operation of systemd?
> >
> > We don't care how a mount point came to be. If it's autofs or
> > something else, we don't care. We don't access these mount points
> > ourselves ever, we just watch their existance.
> >
> > I mean, it's not just about startup it's also about shutdown. At
> > shutdown we need to unmount everything from the leaves towards the
> > root so that all file systems are in a clean state.
>
> Unfortunately that's not guaranteed by umounting all filesystems from
> the init namespace. A filesystem is shut down when all references to
> it are gone. Perhaps you instead want to lazy unmount root (yeah,
> that may not actually be allowed, but anyway, lazy unmounting the top
> level ones should do) and watch for super block shutdown events
> instead.
>
> Does that make any sense?

When all mounts in the init mount namespace are unmounted and all
remaining processes killed we switch root back to the initrd, so that
even the root fs can be unmounted, and then we disassemble any backing
complex storage if there is, i.e. lvm, luks, raid, …

Because the initrd is its own little root fs independent of the actual
root we can fully disassemble everything this way, as we do not retain
any references to it anymore in any way.

Lennart

--
Lennart Poettering, Berlin

2020-04-03 11:26:57

by Lennart Poettering

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Fr, 03.04.20 09:44, Ian Kent ([email protected]) wrote:

> > Currently the only way to find the mount id from a path is by parsing
> > /proc/self/fdinfo/$fd. It is trivial, however, to extend statx(2) to
> > return it directly from a path. Also the mount notification queue
> > that David implemented contains the mount ID of the changed mount.

I would love to have the mount ID exposed via statx().

In systemd we generally try name_to_handle_at() to query the mount ID
first. It returns both the actual fhandle and the mount ID after all,
and we then throw the fhandle away. It's not available on all fs
though, but it has the benefit that it works without procfs and on
a number of older kernels that didn't expose the mnt id in fdinfo.

Lennart

--
Lennart Poettering, Berlin

2020-04-03 11:38:58

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Fri, Apr 3, 2020 at 1:11 PM Lennart Poettering <[email protected]> wrote:
>
> On Fr, 03.04.20 09:44, Ian Kent ([email protected]) wrote:
>
> > > Currently the only way to find the mount id from a path is by parsing
> > > /proc/self/fdinfo/$fd. It is trivial, however, to extend statx(2) to
> > > return it directly from a path. Also the mount notification queue
> > > that David implemented contains the mount ID of the changed mount.
>
> I would love to have the mount ID exposed via statx().

Here's a patch.

Thanks,
Miklos


Attachments:
statx-add-mount-id.patch (2.20 kB)

2020-04-03 11:50:40

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Fri, Apr 3, 2020 at 1:08 PM Lennart Poettering <[email protected]> wrote:
>
> On Do, 02.04.20 19:20, Miklos Szeredi ([email protected]) wrote:
>
> > On Thu, Apr 2, 2020 at 5:50 PM Lennart Poettering <[email protected]> wrote:
> > >
> > > On Do, 02.04.20 17:35, Miklos Szeredi ([email protected]) wrote:
> > >
> > > > > systemd cares about all mount points in PID1's mount namespace.
> > > > >
> > > > > The fact that mount tables can grow large is why we want something
> > > > > better than constantly reparsing the whole /proc/self/mountinfo. But
> > > > > filtering subsets of that is something we don't really care about.
> > > >
> > > > I can accept that, but you haven't given a reason why that's so.
> > > >
> > > > What does it do with the fact that an automount point was crossed, for
> > > > example? How does that affect the operation of systemd?
> > >
> > > We don't care how a mount point came to be. If it's autofs or
> > > something else, we don't care. We don't access these mount points
> > > ourselves ever, we just watch their existance.
> > >
> > > I mean, it's not just about startup it's also about shutdown. At
> > > shutdown we need to unmount everything from the leaves towards the
> > > root so that all file systems are in a clean state.
> >
> > Unfortunately that's not guaranteed by umounting all filesystems from
> > the init namespace. A filesystem is shut down when all references to
> > it are gone. Perhaps you instead want to lazy unmount root (yeah,
> > that may not actually be allowed, but anyway, lazy unmounting the top
> > level ones should do) and watch for super block shutdown events
> > instead.
> >
> > Does that make any sense?
>
> When all mounts in the init mount namespace are unmounted and all
> remaining processes killed we switch root back to the initrd, so that
> even the root fs can be unmounted, and then we disassemble any backing
> complex storage if there is, i.e. lvm, luks, raid, …

I think it could be done the other way round, much simpler:

- switch back to initrd
- umount root, keeping the tree intact (UMOUNT_DETACHED)
- kill all remaining processes, wait for all to exit

I think that should guarantee that all super blocks have been shut down. Al?

The advantage would be that there's no need to walk the mount tree
unmounting individual leafs, since it's all done automagically.

Thanks,
Miklos

2020-04-03 12:35:58

by Richard Weinberger

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Fri, Apr 3, 2020 at 1:40 PM Miklos Szeredi <[email protected]> wrote:
>
> On Fri, Apr 3, 2020 at 1:11 PM Lennart Poettering <[email protected]> wrote:
> >
> > On Fr, 03.04.20 09:44, Ian Kent ([email protected]) wrote:
> >
> > > > Currently the only way to find the mount id from a path is by parsing
> > > > /proc/self/fdinfo/$fd. It is trivial, however, to extend statx(2) to
> > > > return it directly from a path. Also the mount notification queue
> > > > that David implemented contains the mount ID of the changed mount.
> >
> > I would love to have the mount ID exposed via statx().
>
> Here's a patch.

I was looking more than once for a nice way to get the mount id.
Having it exposed via statx() would be great!

--
Thanks,
//richard

2020-04-03 15:13:20

by Lennart Poettering

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Fr, 03.04.20 13:38, Miklos Szeredi ([email protected]) wrote:

> On Fri, Apr 3, 2020 at 1:11 PM Lennart Poettering <[email protected]> wrote:
> >
> > On Fr, 03.04.20 09:44, Ian Kent ([email protected]) wrote:
> >
> > > > Currently the only way to find the mount id from a path is by parsing
> > > > /proc/self/fdinfo/$fd. It is trivial, however, to extend statx(2) to
> > > > return it directly from a path. Also the mount notification queue
> > > > that David implemented contains the mount ID of the changed mount.
> >
> > I would love to have the mount ID exposed via statx().
>
> Here's a patch.

Oh, this is excellent. I love it, thanks!

BTW, while we are at it: one more thing I'd love to see exposed by
statx() is a simple flag whether the inode is a mount point. There's
plenty code that implements a test like this all over the place, and
it usually isn't very safe. There's one implementation in util-linux
for example (in the /usr/bin/mountpoint binary), and another one in
systemd. Would be awesome to just have a statx() return flag for that,
that would make things *so* much easier and more robust. because in
fact most code isn't very good that implements this, as much of it
just compares st_dev of the specified file and its parent. Better code
compares the mount ID, but as mentioned that's not as pretty as it
could be so far...

Lennart

--
Lennart Poettering, Berlin

2020-04-03 15:37:44

by David Howells

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

Lennart Poettering <[email protected]> wrote:

> BTW, while we are at it: one more thing I'd love to see exposed by
> statx() is a simple flag whether the inode is a mount point.

Note that an inode or a dentry might be a mount point in one namespace, but
not in another. Do you actually mean an inode - or do you actually mean the
(mount,dentry) pair that you're looking at? (Ie. should it be namespace
specific?)

David

2020-04-03 15:45:40

by Lennart Poettering

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Fr, 03.04.20 13:48, Miklos Szeredi ([email protected]) wrote:

> > > Does that make any sense?
> >
> > When all mounts in the init mount namespace are unmounted and all
> > remaining processes killed we switch root back to the initrd, so that
> > even the root fs can be unmounted, and then we disassemble any backing
> > complex storage if there is, i.e. lvm, luks, raid, …
>
> I think it could be done the other way round, much simpler:
>
> - switch back to initrd
> - umount root, keeping the tree intact (UMOUNT_DETACHED)
> - kill all remaining processes, wait for all to exit

Nah. What I wrote above is drastically simplified. It's IRL more
complex. Specific services need to be killed between certain mounts
are unmounted, since they are a backend for another mount. NFS, or
FUSE or stuff like that usually has some processes backing them
around, and we need to stop the mounts they provide before these
services, and then the mounts these services reside on after that, and
so on. It's a complex dependency tree of stuff that needs to be done
in order, so that we can deal with arbitrarily nested mounts, storage
subsystems, and backing services.

Anyway, this all works fine in systemd, the dependency logic is
there. We want a more efficient way to watch mounts, that's
all. Subscribing and constantly reparsing /proc/self/mountinfo is
awful, that's all.

Lennart

--
Lennart Poettering, Berlin

2020-04-03 15:49:54

by Lennart Poettering

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Fr, 03.04.20 16:36, David Howells ([email protected]) wrote:

> Lennart Poettering <[email protected]> wrote:
>
> > BTW, while we are at it: one more thing I'd love to see exposed by
> > statx() is a simple flag whether the inode is a mount point.
>
> Note that an inode or a dentry might be a mount point in one namespace, but
> not in another. Do you actually mean an inode - or do you actually mean the
> (mount,dentry) pair that you're looking at? (Ie. should it be namespace
> specific?)

yes, it should be specific to the mount hierarchy in the current namespace.

Lennart

--
Lennart Poettering, Berlin

2020-04-03 20:31:44

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Fri, Apr 03, 2020 at 05:12:23PM +0200, Lennart Poettering wrote:
> BTW, while we are at it: one more thing I'd love to see exposed by
> statx() is a simple flag whether the inode is a mount point. There's
> plenty code that implements a test like this all over the place, and
> it usually isn't very safe. There's one implementation in util-linux
> for example (in the /usr/bin/mountpoint binary), and another one in
> systemd. Would be awesome to just have a statx() return flag for that,
> that would make things *so* much easier and more robust. because in
> fact most code isn't very good that implements this, as much of it
> just compares st_dev of the specified file and its parent. Better code
> compares the mount ID, but as mentioned that's not as pretty as it
> could be so far...

nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file
and ".." and returns true if they have different st_dev or the same
st_ino. Comparing mount ids sounds better.

So anyway, yes, everybody reinvents the wheel here, and this would be
useful. (And, yes, we want to know for the vfsmount, we don't care
whether the same inode is used as a mountpoint someplace else.)

--b.

2020-04-06 08:37:14

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Fri, Apr 3, 2020 at 10:30 PM J. Bruce Fields <[email protected]> wrote:
>
> On Fri, Apr 03, 2020 at 05:12:23PM +0200, Lennart Poettering wrote:
> > BTW, while we are at it: one more thing I'd love to see exposed by
> > statx() is a simple flag whether the inode is a mount point. There's
> > plenty code that implements a test like this all over the place, and
> > it usually isn't very safe. There's one implementation in util-linux
> > for example (in the /usr/bin/mountpoint binary), and another one in
> > systemd. Would be awesome to just have a statx() return flag for that,
> > that would make things *so* much easier and more robust. because in
> > fact most code isn't very good that implements this, as much of it
> > just compares st_dev of the specified file and its parent. Better code
> > compares the mount ID, but as mentioned that's not as pretty as it
> > could be so far...
>
> nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file
> and ".." and returns true if they have different st_dev or the same
> st_ino. Comparing mount ids sounds better.
>
> So anyway, yes, everybody reinvents the wheel here, and this would be
> useful. (And, yes, we want to know for the vfsmount, we don't care
> whether the same inode is used as a mountpoint someplace else.)

Attaching a patch.

There's some ambiguity about what is a "mountpoint" and what these
tools are interested in. My guess is that they are not interested in
an object being a mount point (something where another object is
mounted) but being a mount root (this is the object mounted at the
mount point). I.e

fd = open("/mnt", O_PATH);
mount("/bin", "/mnt", NULL, MS_BIND, NULL);
statx(AT_FDCWD, "/mnt", 0, 0, &stx1);
statx(fd, "", AT_EMPTY_PATH, 0, &stx2);
printf("mount_root(/mnt) = %c, mount_root(fd) = %c\n",
stx1.stx_attributes & STATX_ATTR_MOUNT_ROOT ? 'y' : 'n',
stx2.stx_attributes & STATX_ATTR_MOUNT_ROOT ? 'y' : 'n');

Would print:
mount_root(/mnt) = y, mount_root(fd) = n

Thanks,
Miklos


Attachments:
statx-add-mount_root.patch (1.48 kB)

2020-04-06 09:17:58

by Karel Zak

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Fri, Apr 03, 2020 at 04:30:24PM -0400, J. Bruce Fields wrote:
> On Fri, Apr 03, 2020 at 05:12:23PM +0200, Lennart Poettering wrote:
> > BTW, while we are at it: one more thing I'd love to see exposed by
> > statx() is a simple flag whether the inode is a mount point. There's
> > plenty code that implements a test like this all over the place, and
> > it usually isn't very safe. There's one implementation in util-linux
> > for example (in the /usr/bin/mountpoint binary), and another one in
> > systemd. Would be awesome to just have a statx() return flag for that,
> > that would make things *so* much easier and more robust. because in
> > fact most code isn't very good that implements this, as much of it
> > just compares st_dev of the specified file and its parent. Better code
> > compares the mount ID, but as mentioned that's not as pretty as it
> > could be so far...
>
> nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file
> and ".." and returns true if they have different st_dev or the same
> st_ino. Comparing mount ids sounds better.

BTW, this traditional st_dev+st_ino way is not reliable for bind mounts.
For mountpoint(1) we search the directory in /proc/self/mountinfo.

> So anyway, yes, everybody reinvents the wheel here, and this would be
> useful.

+1

Karel

--
Karel Zak <[email protected]>
http://karelzak.blogspot.com

2020-04-06 09:23:42

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Fri, Apr 3, 2020 at 5:01 PM Lennart Poettering <[email protected]> wrote:
>
> On Fr, 03.04.20 13:48, Miklos Szeredi ([email protected]) wrote:
>
> > > > Does that make any sense?
> > >
> > > When all mounts in the init mount namespace are unmounted and all
> > > remaining processes killed we switch root back to the initrd, so that
> > > even the root fs can be unmounted, and then we disassemble any backing
> > > complex storage if there is, i.e. lvm, luks, raid, …
> >
> > I think it could be done the other way round, much simpler:
> >
> > - switch back to initrd
> > - umount root, keeping the tree intact (UMOUNT_DETACHED)
> > - kill all remaining processes, wait for all to exit
>
> Nah. What I wrote above is drastically simplified. It's IRL more
> complex. Specific services need to be killed between certain mounts
> are unmounted, since they are a backend for another mount. NFS, or
> FUSE or stuff like that usually has some processes backing them
> around, and we need to stop the mounts they provide before these
> services, and then the mounts these services reside on after that, and
> so on. It's a complex dependency tree of stuff that needs to be done
> in order, so that we can deal with arbitrarily nested mounts, storage
> subsystems, and backing services.

That still doesn't explain why you need to keep track of all mounts in
the system.

If you are aware of the dependency, then you need to keep track of
that particular mount. If not, then why?

What I'm starting to see is that there's a fundamental conflict
between how systemd people want to deal with new mounts and how some
other people want to use mounts (i.e. tens of thousands of mounts in
an automount map).

I'm really curious how much the mount notification ring + per mount
query (any implementation) can help that use case.

> Anyway, this all works fine in systemd, the dependency logic is
> there. We want a more efficient way to watch mounts, that's
> all. Subscribing and constantly reparsing /proc/self/mountinfo is
> awful, that's all.

I'm not sure that is all. To handle storms of tens of thousands of
mounts, my guess is that the fundamental way of dealing with these
changes will need to be updated in systemd.

Thanks,
Miklos

2020-04-06 16:08:09

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

The patch makes sense to me, thanks!

In the NFS case it's implementing the "mountpoint" export option:

mountpoint=path

mp This option makes it possible to only export a directory if it
has successfully been mounted. If no path is given (e.g.
mountpoint or mp) then the export point must also be a mount
point. If it isn't then the export point is not exported. This
allows you to be sure that the directory underneath a mountpoint
will never be exported by accident if, for example, the filesys‐
tem failed to mount due to a disc error.

If a path is given (e.g. mountpoint=/path or mp=/path) then the
nominated path must be a mountpoint for the exportpoint to be
exported.

--b.

On Mon, Apr 06, 2020 at 10:35:55AM +0200, Miklos Szeredi wrote:
> From: Miklos Szeredi <[email protected]>
> Subject: statx: add mount_root
>
> Determining whether a path or file descriptor refers to a mountpoint (or
> more precisely a mount root) is not trivial using current tools.
>
> Add a flag to statx that indicates whether the path or fd refers to the
> root of a mount or not.
>
> Reported-by: Lennart Poettering <[email protected]>
> Reported-by: J. Bruce Fields <[email protected]>
> Signed-off-by: Miklos Szeredi <[email protected]>
> ---
> fs/stat.c | 3 +++
> include/uapi/linux/stat.h | 1 +
> 2 files changed, 4 insertions(+)
>
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -172,6 +172,7 @@ struct statx {
> #define STATX_ATTR_NODUMP 0x00000040 /* [I] File is not to be dumped */
> #define STATX_ATTR_ENCRYPTED 0x00000800 /* [I] File requires key to decrypt in fs */
> #define STATX_ATTR_AUTOMOUNT 0x00001000 /* Dir: Automount trigger */
> +#define STATX_ATTR_MOUNT_ROOT 0x00002000 /* Root of a mount */
> #define STATX_ATTR_VERITY 0x00100000 /* [I] Verity protected file */
>
>
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -202,6 +202,9 @@ int vfs_statx(int dfd, const char __user
> error = vfs_getattr(&path, stat, request_mask, flags);
> stat->mnt_id = real_mount(path.mnt)->mnt_id;
> stat->result_mask |= STATX_MNT_ID;
> + if (path.mnt->mnt_root == path.dentry)
> + stat->attributes |= STATX_ATTR_MOUNT_ROOT;
> + stat->attributes_mask |= STATX_ATTR_MOUNT_ROOT;
> path_put(&path);
> if (retry_estale(error, lookup_flags)) {
> lookup_flags |= LOOKUP_REVAL;

2020-04-06 16:36:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Mon, Apr 6, 2020 at 2:17 AM Karel Zak <[email protected]> wrote:
>
> On Fri, Apr 03, 2020 at 04:30:24PM -0400, J. Bruce Fields wrote:
> >
> > nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file
> > and ".." and returns true if they have different st_dev or the same
> > st_ino. Comparing mount ids sounds better.
>
> BTW, this traditional st_dev+st_ino way is not reliable for bind mounts.
> For mountpoint(1) we search the directory in /proc/self/mountinfo.

These days you should probably use openat2() with RESOLVE_NO_XDEV.

No need for any mountinfo or anything like that. Just look up the
pathname and say "don't cross mount-points", and you'll get an error
if it's a mount crossing lookup.

So this kind of thing is _not_ an argument for another kernel querying
interface. We got a new (and better) model for a lot of this.

Linus

2020-04-06 17:46:34

by Lennart Poettering

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Mo, 06.04.20 11:22, Miklos Szeredi ([email protected]) wrote:

> > Nah. What I wrote above is drastically simplified. It's IRL more
> > complex. Specific services need to be killed between certain mounts
> > are unmounted, since they are a backend for another mount. NFS, or
> > FUSE or stuff like that usually has some processes backing them
> > around, and we need to stop the mounts they provide before these
> > services, and then the mounts these services reside on after that, and
> > so on. It's a complex dependency tree of stuff that needs to be done
> > in order, so that we can deal with arbitrarily nested mounts, storage
> > subsystems, and backing services.
>
> That still doesn't explain why you need to keep track of all mounts in
> the system.
>
> If you are aware of the dependency, then you need to keep track of
> that particular mount. If not, then why?

it works the other way round in systemd: something happens, i.e. a
device pops up or a mount is established and systemd figures our if
there's something to do. i.e. whether services shall be pulled in or
so.

It's that way for a reason: there are plenty services that want to
instantiated once for each object of a certain kind to pop up (this
happens very often for devices, but could also happen for any other
kind of "unit" systemd manages, and one of those kinds are mount
units). For those we don't know the unit to pull in yet (because it's
not going to be a well-named singleton, but an instance incorporating
some identifier from the source unit) when the unit that pops up does
so, thus we can only wait for the the latter to determine what to pull
in.

> What I'm starting to see is that there's a fundamental conflict
> between how systemd people want to deal with new mounts and how some
> other people want to use mounts (i.e. tens of thousands of mounts in
> an automount map).

Well, I am not sure what automount has to do with anything. You can
have 10K mounts with or without automount, it's orthogonal to that. In
fact, I assumed the point of automount was to pretend there are 10K
mounts but not actually have them most of the time, no?

I mean, whether there's room to optimize D-Bus IPC or not is entirely
orthogonal to anything discussed here regarding fsinfo(). Don't make
this about systemd sending messages over D-Bus, that's a very
different story, and a non-issue if you ask me:

Right now, when you have n mounts, and any mount changes, or one is
added or removed then we have to parse the whole mount table again,
asynchronously, processing all n entries again, every frickin
time. This means the work to process n mounts popping up at boot is
O(n?). That sucks, it should be obvious to anyone. Now if we get that
fixed, by some mount API that can send us minimal notifications about
what happened and where, then this becomes O(n), which is totally OK.

You keep talking about filtering, which will just lower the "n" a bit
in particular cases to some value "m" maybe (with m < n), it does not
address the fact that O(m?) is still a big problem.

hence, filtering is great, no problem, add it if you want it. I
personally don't care about filtering though, and I doubt we'd use it
in systemd, I just care about the O(n?) issue.

If you ask me if D-Bus can handle 10K messages sent over the bus
during boot, then yes, it totally can handle that. Can systemd nicely
process O(n?) mounts internally though equally well? No, obviously not,
if n grows too large. Anyone computer scientist should understand that..

Anyway, I have the suspicion this discussion has stopped being
useful. I think you are trying to fix problems that userspce actually
doesn't have. I can just tell you what we understand the problems are,
but if you are out trying to fix other percieved ones, then great, but
I mostly lost interest.

Lennart

--
Lennart Poettering, Berlin

2020-04-06 18:48:15

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Mon, Apr 06, 2020 at 09:34:08AM -0700, Linus Torvalds wrote:
> On Mon, Apr 6, 2020 at 2:17 AM Karel Zak <[email protected]> wrote:
> >
> > On Fri, Apr 03, 2020 at 04:30:24PM -0400, J. Bruce Fields wrote:
> > >
> > > nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file
> > > and ".." and returns true if they have different st_dev or the same
> > > st_ino. Comparing mount ids sounds better.
> >
> > BTW, this traditional st_dev+st_ino way is not reliable for bind mounts.
> > For mountpoint(1) we search the directory in /proc/self/mountinfo.
>
> These days you should probably use openat2() with RESOLVE_NO_XDEV.
>
> No need for any mountinfo or anything like that. Just look up the
> pathname and say "don't cross mount-points", and you'll get an error
> if it's a mount crossing lookup.

OK, I can't see why that wouldn't work, thanks.

--b.

>
> So this kind of thing is _not_ an argument for another kernel querying
> interface. We got a new (and better) model for a lot of this.
>
> Linus

2020-04-06 18:49:08

by Lennart Poettering

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Mo, 06.04.20 09:34, Linus Torvalds ([email protected]) wrote:

> On Mon, Apr 6, 2020 at 2:17 AM Karel Zak <[email protected]> wrote:
> >
> > On Fri, Apr 03, 2020 at 04:30:24PM -0400, J. Bruce Fields wrote:
> > >
> > > nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file
> > > and ".." and returns true if they have different st_dev or the same
> > > st_ino. Comparing mount ids sounds better.
> >
> > BTW, this traditional st_dev+st_ino way is not reliable for bind mounts.
> > For mountpoint(1) we search the directory in /proc/self/mountinfo.
>
> These days you should probably use openat2() with RESOLVE_NO_XDEV.

Note that opening a file is relatively "heavy" i.e. typically triggers
autofs and stuff, and results in security checks (which can fail and
such, and show up in audit).

statx() doesn't do that, and that's explicitly documented
(i.e. AT_NO_AUTOMOUNT and stuff).

Hence, unless openat2() has some mechanism of doing something like an
"open() but not really" (O_PATH isn't really sufficient for this, no?)
I don't think it could be a good replacement for a statx() type check
if something is a mount point or not.

I mean, think about usecases: a common usecase for "is this a
mountpoint" checks are tools that traverse directory trees and want to
stop at submounts. They generally try to minimize operations and hence
stat stuff but don't open anything unless its what they look foor (or a
subdir they identified as a non-submount). Doing an extra openat2() in
between there doesn't sound so attractive, since you pay heavily...

Lennart

--
Lennart Poettering, Berlin

2020-04-07 02:23:02

by Ian Kent

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Mon, 2020-04-06 at 19:29 +0200, Lennart Poettering wrote:
> On Mo, 06.04.20 11:22, Miklos Szeredi ([email protected]) wrote:
>
> > > Nah. What I wrote above is drastically simplified. It's IRL more
> > > complex. Specific services need to be killed between certain
> > > mounts
> > > are unmounted, since they are a backend for another mount. NFS,
> > > or
> > > FUSE or stuff like that usually has some processes backing them
> > > around, and we need to stop the mounts they provide before these
> > > services, and then the mounts these services reside on after
> > > that, and
> > > so on. It's a complex dependency tree of stuff that needs to be
> > > done
> > > in order, so that we can deal with arbitrarily nested mounts,
> > > storage
> > > subsystems, and backing services.
> >
> > That still doesn't explain why you need to keep track of all mounts
> > in
> > the system.
> >
> > If you are aware of the dependency, then you need to keep track of
> > that particular mount. If not, then why?
>
> it works the other way round in systemd: something happens, i.e. a
> device pops up or a mount is established and systemd figures our if
> there's something to do. i.e. whether services shall be pulled in or
> so.
>
> It's that way for a reason: there are plenty services that want to
> instantiated once for each object of a certain kind to pop up (this
> happens very often for devices, but could also happen for any other
> kind of "unit" systemd manages, and one of those kinds are mount
> units). For those we don't know the unit to pull in yet (because it's
> not going to be a well-named singleton, but an instance incorporating
> some identifier from the source unit) when the unit that pops up does
> so, thus we can only wait for the the latter to determine what to
> pull
> in.
>
> > What I'm starting to see is that there's a fundamental conflict
> > between how systemd people want to deal with new mounts and how
> > some
> > other people want to use mounts (i.e. tens of thousands of mounts
> > in
> > an automount map).
>
> Well, I am not sure what automount has to do with anything. You can
> have 10K mounts with or without automount, it's orthogonal to that.
> In
> fact, I assumed the point of automount was to pretend there are 10K
> mounts but not actually have them most of the time, no?

Yes, but automount, when using a large direct mount map will, be the
source of lots of mounts which of an autofs file system.

>
> I mean, whether there's room to optimize D-Bus IPC or not is entirely
> orthogonal to anything discussed here regarding fsinfo(). Don't make
> this about systemd sending messages over D-Bus, that's a very
> different story, and a non-issue if you ask me:

Quite probably, yes, that's something you can care about if it really
is an issue but isn't something I care about myself either.

>
> Right now, when you have n mounts, and any mount changes, or one is
> added or removed then we have to parse the whole mount table again,
> asynchronously, processing all n entries again, every frickin
> time. This means the work to process n mounts popping up at boot is
> O(n²). That sucks, it should be obvious to anyone. Now if we get that
> fixed, by some mount API that can send us minimal notifications about
> what happened and where, then this becomes O(n), which is totally OK.

But this is clearly a problem and is what I do care about and the
infrastructure being proposed here can be used to achieve this.

Unfortunately, and I was mistaken about what systemd does, I don't
see a simple way of improving this. This is because it appears that
systemd, having had to scan the entire mount table every time has,
necessarily, lead to code that can't easily accommodate the ability
to directly get the info immediately for a single mount.

So to improve this I think quite a few changes will be needed in
systemd and libmount. I'm not quite sure how to get that started.
After all it needs to be done how Karel would like to see it done
in libmount and how systemd folks would like to see it done in
systemd which is very probably not how I would approach it myself.

>
> You keep talking about filtering, which will just lower the "n" a bit
> in particular cases to some value "m" maybe (with m < n), it does not
> address the fact that O(m²) is still a big problem.
>
> hence, filtering is great, no problem, add it if you want it. I
> personally don't care about filtering though, and I doubt we'd use it
> in systemd, I just care about the O(n²) issue.
>
> If you ask me if D-Bus can handle 10K messages sent over the bus
> during boot, then yes, it totally can handle that. Can systemd nicely
> process O(n²) mounts internally though equally well? No, obviously
> not,
> if n grows too large. Anyone computer scientist should understand
> that..
>
> Anyway, I have the suspicion this discussion has stopped being
> useful. I think you are trying to fix problems that userspce actually
> doesn't have. I can just tell you what we understand the problems
> are,
> but if you are out trying to fix other percieved ones, then great,
> but
> I mostly lost interest.

Yes, filtering sounds like we've wandered off topic, ;)

Ian

2020-04-07 14:01:48

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Tue, Apr 7, 2020 at 4:22 AM Ian Kent <[email protected]> wrote:
> > Right now, when you have n mounts, and any mount changes, or one is
> > added or removed then we have to parse the whole mount table again,
> > asynchronously, processing all n entries again, every frickin
> > time. This means the work to process n mounts popping up at boot is
> > O(n²). That sucks, it should be obvious to anyone. Now if we get that
> > fixed, by some mount API that can send us minimal notifications about
> > what happened and where, then this becomes O(n), which is totally OK.

Something's not right with the above statement. Hint: if there are
lots of events in quick succession, you can batch them quite easily to
prevent overloading the system.

Wrote a pair of utilities to check out the capabilities of the current
API. The first one just creates N mounts, optionally sleeping
between each. The second one watches /proc/self/mountinfo and
generates individual (add/del/change) events based on POLLPRI and
comparing contents with previous instance.

First use case: create 10,000 mounts, then start the watcher and
create 1000 mounts with a 50ms sleep between them. Total time (user +
system) consumed by the watcher: 25s. This is indeed pretty dismal,
and a per-mount query will help tremendously. But it's still "just"
25ms per mount, so if the mounts are far apart (which is what this
test is about), this won't thrash the system. Note, how this is self
regulating: if the load is high, it will automatically batch more
requests, preventing overload. It is also prone to lose pairs of add
+ remove in these case (and so is the ring buffer based one from
David).

Second use case: start the watcher and create 50,000 mounts with no
sleep between them. Total time consumed by the watcher: 0.154s or
3.08us/event. Note, the same test case adds about 5ms for the
50,000 umount events, which is 0.1us/event.

Real life will probably be between these extremes, but it's clear that
there's room for improvement in userspace as well as kernel
interfaces. The current kernel interface is very efficient in
retrieving a lot of state in one go. It is not efficient in handling
small differences.

> > Anyway, I have the suspicion this discussion has stopped being
> > useful. I think you are trying to fix problems that userspce actually
> > doesn't have. I can just tell you what we understand the problems
> > are,
> > but if you are out trying to fix other percieved ones, then great,
> > but
> > I mostly lost interest.

I was, and still am, trying to see the big picture.

Whatever. I think it's your turn to show some numbers about how the
new API improves performance of systemd with a large number of mounts.

Thanks,
Miklos


Attachments:
many-mounts.c (1.13 kB)
watch_mounts.c (3.30 kB)
Download all attachments

2020-04-07 15:55:37

by Lennart Poettering

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Di, 07.04.20 15:59, Miklos Szeredi ([email protected]) wrote:

> On Tue, Apr 7, 2020 at 4:22 AM Ian Kent <[email protected]> wrote:
> > > Right now, when you have n mounts, and any mount changes, or one is
> > > added or removed then we have to parse the whole mount table again,
> > > asynchronously, processing all n entries again, every frickin
> > > time. This means the work to process n mounts popping up at boot is
> > > O(n²). That sucks, it should be obvious to anyone. Now if we get that
> > > fixed, by some mount API that can send us minimal notifications about
> > > what happened and where, then this becomes O(n), which is totally OK.
>
> Something's not right with the above statement. Hint: if there are
> lots of events in quick succession, you can batch them quite easily to
> prevent overloading the system.
>
> Wrote a pair of utilities to check out the capabilities of the current
> API. The first one just creates N mounts, optionally sleeping
> between each. The second one watches /proc/self/mountinfo and
> generates individual (add/del/change) events based on POLLPRI and
> comparing contents with previous instance.
>
> First use case: create 10,000 mounts, then start the watcher and
> create 1000 mounts with a 50ms sleep between them. Total time (user +
> system) consumed by the watcher: 25s. This is indeed pretty dismal,
> and a per-mount query will help tremendously. But it's still "just"
> 25ms per mount, so if the mounts are far apart (which is what this
> test is about), this won't thrash the system. Note, how this is self
> regulating: if the load is high, it will automatically batch more
> requests, preventing overload. It is also prone to lose pairs of add
> + remove in these case (and so is the ring buffer based one from
> David).

We will batch requests too in systemd, of course, necessarily, given
that the /p/s/mi inotify stuff is async. Thing though is that this
means we buy lower CPU usage — working around the O(n²) issue — by
introducing artifical higher latencies. We usually want to boot
quickly, and not artificially slow.

Sure one can come up with some super smart scheme how to tweak the
artifical latencies, how to grow them, how to shrink them, depending
on a perceived flood of events, some backing off scheme. But that's
just polishing a turd, if all we want is proper queued change
notification without the O(n²) behaviour.

I mean, the fix for an O(n²) algorithm is to make it O(n) or so. By
coalescing wake-up events you just lower the n again, probably
linearly, but that still means we pay O(n²), which sucks.

Lennart

--
Lennart Poettering, Berlin

2020-04-07 16:08:47

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Tue, Apr 7, 2020 at 5:53 PM Lennart Poettering <[email protected]> wrote:
>
> On Di, 07.04.20 15:59, Miklos Szeredi ([email protected]) wrote:
>
> > On Tue, Apr 7, 2020 at 4:22 AM Ian Kent <[email protected]> wrote:
> > > > Right now, when you have n mounts, and any mount changes, or one is
> > > > added or removed then we have to parse the whole mount table again,
> > > > asynchronously, processing all n entries again, every frickin
> > > > time. This means the work to process n mounts popping up at boot is
> > > > O(n²). That sucks, it should be obvious to anyone. Now if we get that
> > > > fixed, by some mount API that can send us minimal notifications about
> > > > what happened and where, then this becomes O(n), which is totally OK.
> >
> > Something's not right with the above statement. Hint: if there are
> > lots of events in quick succession, you can batch them quite easily to
> > prevent overloading the system.
> >
> > Wrote a pair of utilities to check out the capabilities of the current
> > API. The first one just creates N mounts, optionally sleeping
> > between each. The second one watches /proc/self/mountinfo and
> > generates individual (add/del/change) events based on POLLPRI and
> > comparing contents with previous instance.
> >
> > First use case: create 10,000 mounts, then start the watcher and
> > create 1000 mounts with a 50ms sleep between them. Total time (user +
> > system) consumed by the watcher: 25s. This is indeed pretty dismal,
> > and a per-mount query will help tremendously. But it's still "just"
> > 25ms per mount, so if the mounts are far apart (which is what this
> > test is about), this won't thrash the system. Note, how this is self
> > regulating: if the load is high, it will automatically batch more
> > requests, preventing overload. It is also prone to lose pairs of add
> > + remove in these case (and so is the ring buffer based one from
> > David).
>
> We will batch requests too in systemd, of course, necessarily, given
> that the /p/s/mi inotify stuff is async. Thing though is that this
> means we buy lower CPU usage — working around the O(n²) issue — by
> introducing artifical higher latencies. We usually want to boot
> quickly, and not artificially slow.
>
> Sure one can come up with some super smart scheme how to tweak the
> artifical latencies, how to grow them, how to shrink them, depending
> on a perceived flood of events, some backing off scheme. But that's
> just polishing a turd, if all we want is proper queued change
> notification without the O(n²) behaviour.
>
> I mean, the fix for an O(n²) algorithm is to make it O(n) or so. By
> coalescing wake-up events you just lower the n again, probably
> linearly, but that still means we pay O(n²), which sucks.

Obviously. But you keep ignoring event queue overflows; it's
basically guaranteed to happen with a sizable mount storm and then you
are back to O(n^2).

Give it some testing please, as Linus is not going to take any
solution without an actual use case and testing. When you come back
and say that fsinfo(2) works fine with systemd and a 100k mount/umount
storm, then we can have a look at the performance budget and revisit
the fine points of API design.

Thanks,
Miklos

2020-04-08 03:39:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: Upcoming: Notifications, FS notifications and fsinfo()

On Mon, Apr 6, 2020 at 11:48 AM Lennart Poettering <[email protected]> wrote:
>
> On Mo, 06.04.20 09:34, Linus Torvalds ([email protected]) wrote:
>
> > On Mon, Apr 6, 2020 at 2:17 AM Karel Zak <[email protected]> wrote:
> > >
> > > On Fri, Apr 03, 2020 at 04:30:24PM -0400, J. Bruce Fields wrote:
> > > >
> > > > nfs-utils/support/misc/mountpoint.c:check_is_mountpoint() stats the file
> > > > and ".." and returns true if they have different st_dev or the same
> > > > st_ino. Comparing mount ids sounds better.
> > >
> > > BTW, this traditional st_dev+st_ino way is not reliable for bind mounts.
> > > For mountpoint(1) we search the directory in /proc/self/mountinfo.
> >
> > These days you should probably use openat2() with RESOLVE_NO_XDEV.
>
> Note that opening a file is relatively "heavy" i.e. typically triggers
> autofs and stuff, and results in security checks (which can fail and
> such, and show up in audit).

For the use that Bruce outlined, openat2() with RESOLVE_NO_XDEV is
absolutely the right thing.

He already did the stat() of the file (and ".."), RESOLVE_NO_XDEV is
only an improvement. It's also a lot better than trying to parse
mountinfo.

Now, I don't disagree that a statx() flag to also indicate "that's a
top-level mount" might be a good idea, and may be the right answer for
other cases.

I'm just saying that considering what Bruce does now, RESOLVE_NO_XDEV
sounds like the nobrainer approach, and needs no new support outside
of what we already had for other reasons.

(And O_PATH _may_ or may not be part of what you want to do, it's an
independent separate issue, but automount behavior wrt a O_PATH lookup
is somewhat unclear - see Al's other emails on that subject)

Linus