2023-04-13 22:01:55

by Jeff Layton

[permalink] [raw]
Subject: allowing for a completely cached umount(2) pathwalk

David Wysochanski posted this a week ago:

https://lore.kernel.org/linux-nfs/CALF+zOnizN1KSE=V095LV6Mug8dJirqk7eN1joX8L1-EroohPA@mail.gmail.com/

It describes a situation where there are nested NFS mounts on a client,
and one of the intermediate mounts ends up being unexported from the
server. In a situation like this, we end up being unable to pathwalk
down to the child mount of these unreachable dentries and can't unmount
anything, even as root.

A decade ago, we did some work to make the kernel not revalidate the
leaf dentry on a umount [1]. This helped some similar sorts of problems
but it doesn't help if the problem is an intermediate dentry.

The idea at the time was that umount(2) is a special case: We are
specifically looking to stop using the mount, so there's nothing to be
gained by revalidating its root dentry and inode.

Based on the problem Dave describes, I'd submit that umount(2) is
special in another way too: It's intended to manipulate the mount table
of the local host, so contacting the backing store (the NFS server in
this case) during a pathwalk doesn't really help anything. All we care
about is getting to the right spot in the mount tree.

A "modest" proposal:
--------------------
This is still somewhat handwavy, but what if we were to make umount(2)
an even more special case for the pathwalk? For the umount(2) pathwalk,
we could:

1/ walk down the dentry tree without calling ->d_revalidate: We don't
care about changes that might have happened remotely. All we care about
is walking down the cached dentries as they are at that moment.

2/ disallow ->lookup operations: a umount is about removing an existing
mount, so the dentries had better already be there.

3/ skip inode ->permission checks. We don't want to check with the
server about our permission to walk the path when we're looking to
unmount. We're walking down the path on the _local_ machine so we can
unuse it. The server should have no say-so in the matter. (We probably
would want to require CAP_SYS_ADMIN or CAP_DAC_READ_SEARCH for this of
course).

We might need other safety checks too that I haven't considered yet.

Is this a terrible idea? Are there potentially problems with
containerized setups if we were to do something like this? Are there
better ways to solve this problem (and others like it)? Maybe this would
be best done with a new UMOUNT_CACHED flag for umount2()?

--
Jeff Layton <[email protected]>

[1] 8033426e6bdb vfs: allow umount to handle mountpoints without
revalidating them


2023-04-13 22:25:51

by Andreas Dilger

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Apr 13, 2023, at 4:00 PM, Jeff Layton <[email protected]> wrote:
>
> David Wysochanski posted this a week ago:
>
> https://lore.kernel.org/linux-nfs/CALF+zOnizN1KSE=V095LV6Mug8dJirqk7eN1joX8L1-EroohPA@mail.gmail.com/
>
> It describes a situation where there are nested NFS mounts on a client,
> and one of the intermediate mounts ends up being unexported from the
> server. In a situation like this, we end up being unable to pathwalk
> down to the child mount of these unreachable dentries and can't unmount
> anything, even as root.
>
> A decade ago, we did some work to make the kernel not revalidate the
> leaf dentry on a umount [1]. This helped some similar sorts of problems
> but it doesn't help if the problem is an intermediate dentry.
>
> The idea at the time was that umount(2) is a special case: We are
> specifically looking to stop using the mount, so there's nothing to be
> gained by revalidating its root dentry and inode.
>
> Based on the problem Dave describes, I'd submit that umount(2) is
> special in another way too: It's intended to manipulate the mount table
> of the local host, so contacting the backing store (the NFS server in
> this case) during a pathwalk doesn't really help anything. All we care
> about is getting to the right spot in the mount tree.
>
> A "modest" proposal:
> --------------------
> This is still somewhat handwavy, but what if we were to make umount(2)
> an even more special case for the pathwalk? For the umount(2) pathwalk,
> we could:
>
> 1/ walk down the dentry tree without calling ->d_revalidate: We don't
> care about changes that might have happened remotely. All we care about
> is walking down the cached dentries as they are at that moment.
>
> 2/ disallow ->lookup operations: a umount is about removing an existing
> mount, so the dentries had better already be there.
>
> 3/ skip inode ->permission checks. We don't want to check with the
> server about our permission to walk the path when we're looking to
> unmount. We're walking down the path on the _local_ machine so we can
> unuse it. The server should have no say-so in the matter. (We probably
> would want to require CAP_SYS_ADMIN or CAP_DAC_READ_SEARCH for this of
> course).
>
> We might need other safety checks too that I haven't considered yet.
>
> Is this a terrible idea? Are there potentially problems with
> containerized setups if we were to do something like this? Are there
> better ways to solve this problem (and others like it)? Maybe this would
> be best done with a new UMOUNT_CACHED flag for umount2()?
> --
> [1] 8033426e6bdb vfs: allow umount to handle mountpoints without
> revalidating them

This would be great. This has been a problem when unmounting the
filesystem ever since "umount" was changed to stat() the mountpoint
by path before unmount, added in util-linux 2.19 (FC15, el7, SLES12).

Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2023-04-13 22:43:27

by NeilBrown

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Fri, 14 Apr 2023, Jeff Layton wrote:
> David Wysochanski posted this a week ago:
>
> https://lore.kernel.org/linux-nfs/CALF+zOnizN1KSE=V095LV6Mug8dJirqk7eN1joX8L1-EroohPA@mail.gmail.com/
>
> It describes a situation where there are nested NFS mounts on a client,
> and one of the intermediate mounts ends up being unexported from the
> server. In a situation like this, we end up being unable to pathwalk
> down to the child mount of these unreachable dentries and can't unmount
> anything, even as root.
>
> A decade ago, we did some work to make the kernel not revalidate the
> leaf dentry on a umount [1]. This helped some similar sorts of problems
> but it doesn't help if the problem is an intermediate dentry.
>
> The idea at the time was that umount(2) is a special case: We are
> specifically looking to stop using the mount, so there's nothing to be
> gained by revalidating its root dentry and inode.
>
> Based on the problem Dave describes, I'd submit that umount(2) is
> special in another way too: It's intended to manipulate the mount table
> of the local host, so contacting the backing store (the NFS server in
> this case) during a pathwalk doesn't really help anything. All we care
> about is getting to the right spot in the mount tree.
>
> A "modest" proposal:

I hope you didn't mean to reference
https://en.wikipedia.org/wiki/A_Modest_Proposal ...

> --------------------
> This is still somewhat handwavy, but what if we were to make umount(2)
> an even more special case for the pathwalk? For the umount(2) pathwalk,
> we could:
>
> 1/ walk down the dentry tree without calling ->d_revalidate: We don't
> care about changes that might have happened remotely. All we care about
> is walking down the cached dentries as they are at that moment.
>
> 2/ disallow ->lookup operations: a umount is about removing an existing
> mount, so the dentries had better already be there.
>
> 3/ skip inode ->permission checks. We don't want to check with the
> server about our permission to walk the path when we're looking to
> unmount. We're walking down the path on the _local_ machine so we can
> unuse it. The server should have no say-so in the matter. (We probably
> would want to require CAP_SYS_ADMIN or CAP_DAC_READ_SEARCH for this of
> course).
>
> We might need other safety checks too that I haven't considered yet.
>
> Is this a terrible idea? Are there potentially problems with
> containerized setups if we were to do something like this? Are there
> better ways to solve this problem (and others like it)? Maybe this would
> be best done with a new UMOUNT_CACHED flag for umount2()?

It might be a terrible idea, but it is essentially the same idea that I
had, but hadn't got around to posting.

The path name that appears in /proc/mounts is the key that must be used
to find and unmount a filesystem. When you do that "find"ing you are
not looking up a name in a filesystem, you are looking up a key in the
mount table.

We could, instead, create an api that is given a mount-id (first number
in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could
read /proc/self/mountinfo, find the mount-id, and unmount it - all
without ever doing path name lookup in the traditional sense.

But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed
LOOKUP_CACHED, and it only finds paths that are in the dcache, never
revalidates, at most performs simple permission checks based on cached
content.

NeilBrown

2023-04-14 02:38:10

by Al Viro

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Thu, Apr 13, 2023 at 06:00:42PM -0400, Jeff Layton wrote:

> It describes a situation where there are nested NFS mounts on a client,
> and one of the intermediate mounts ends up being unexported from the
> server. In a situation like this, we end up being unable to pathwalk
> down to the child mount of these unreachable dentries and can't unmount
> anything, even as root.

So umount -l the stuck sucker. What's the problem with that?

> 2/ disallow ->lookup operations: a umount is about removing an existing
> mount, so the dentries had better already be there.

That changes the semantics; as it is, you need exec permissions on the
entire path...

> Is this a terrible idea? Are there potentially problems with
> containerized setups if we were to do something like this? Are there
> better ways to solve this problem (and others like it)? Maybe this would
> be best done with a new UMOUNT_CACHED flag for umount2()?

We already have lazy umount. And what will you do to symlinks you run
into along the way? They *are* traversed; requiring the caller to
canonicalize them will only shift the problem to userland...

2023-04-14 02:44:14

by Al Viro

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote:

> The path name that appears in /proc/mounts is the key that must be used
> to find and unmount a filesystem. When you do that "find"ing you are
> not looking up a name in a filesystem, you are looking up a key in the
> mount table.

No. The path name in /proc/mounts is *NOT* a key - it's a best-effort
attempt to describe the mountpoint. Pathname resolution does not work
in terms of "the longest prefix is found and we handle the rest within
that filesystem".

> We could, instead, create an api that is given a mount-id (first number
> in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could
> read /proc/self/mountinfo, find the mount-id, and unmount it - all
> without ever doing path name lookup in the traditional sense.
>
> But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed
> LOOKUP_CACHED, and it only finds paths that are in the dcache, never
> revalidates, at most performs simple permission checks based on cached
> content.

umount /proc/self/fd/42/barf/something

Discuss.

OTON, umount-by-mount-id is an interesting idea, but we'll need to decide
what would the right permissions be for it.

But please, lose the "mount table is a mapping from path prefix to filesystem"
notion - it really, really is not. IIRC, there are systems that work that way,
but it's nowhere near the semantics used by any Unices, all variants of Linux
included.

2023-04-14 03:30:02

by Trond Myklebust

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk



> On Apr 13, 2023, at 22:43, Al Viro <[email protected]> wrote:
>
> On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote:
>
>> The path name that appears in /proc/mounts is the key that must be used
>> to find and unmount a filesystem. When you do that "find"ing you are
>> not looking up a name in a filesystem, you are looking up a key in the
>> mount table.
>
> No. The path name in /proc/mounts is *NOT* a key - it's a best-effort
> attempt to describe the mountpoint. Pathname resolution does not work
> in terms of "the longest prefix is found and we handle the rest within
> that filesystem".
>
>> We could, instead, create an api that is given a mount-id (first number
>> in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could
>> read /proc/self/mountinfo, find the mount-id, and unmount it - all
>> without ever doing path name lookup in the traditional sense.
>>
>> But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed
>> LOOKUP_CACHED, and it only finds paths that are in the dcache, never
>> revalidates, at most performs simple permission checks based on cached
>> content.
>
> umount /proc/self/fd/42/barf/something
>
> Discuss.
>
> OTON, umount-by-mount-id is an interesting idea, but we'll need to decide
> what would the right permissions be for it.
>
> But please, lose the "mount table is a mapping from path prefix to filesystem"
> notion - it really, really is not. IIRC, there are systems that work that way,
> but it's nowhere near the semantics used by any Unices, all variants of Linux
> included.

We already have support for directory file descriptors when mounting with move_mount(). Why not add a umountat() with similar support for the unmount side?
Then add a syscall to allow users with (e.g.) the CAP_DAC_OVERRIDE privilege to convert the mount-id into an O_PATH file descriptor.


_________________________________
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2023-04-14 04:08:55

by Al Viro

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Fri, Apr 14, 2023 at 03:28:45AM +0000, Trond Myklebust wrote:

> We already have support for directory file descriptors when mounting with move_mount(). Why not add a umountat() with similar support for the unmount side?
> Then add a syscall to allow users with (e.g.) the CAP_DAC_OVERRIDE privilege to convert the mount-id into an O_PATH file descriptor.

You can already do umount -l /proc/self/fd/69 if you have a descriptor.
Converting mount-id to O_PATH... might be an interesting idea.

2023-04-14 04:09:39

by Trond Myklebust

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk



> On Apr 13, 2023, at 23:51, Al Viro <[email protected]> wrote:
>
> On Fri, Apr 14, 2023 at 03:28:45AM +0000, Trond Myklebust wrote:
>
>> We already have support for directory file descriptors when mounting with move_mount(). Why not add a umountat() with similar support for the unmount side?
>> Then add a syscall to allow users with (e.g.) the CAP_DAC_OVERRIDE privilege to convert the mount-id into an O_PATH file descriptor.
>
> You can already do umount -l /proc/self/fd/69 if you have a descriptor.
> Converting mount-id to O_PATH... might be an interesting idea.

A dedicated umountat() might avoid the need for the lazy flag, if it were allowed to close the descriptor on success for the special case of an empty path.

Looking more closely, it would seem that CAP_DAC_READ_SEARCH might be a sufficient privilege requirement for the mount-id -> O_PATH syscall.

_________________________________
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2023-04-14 04:25:25

by Al Viro

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Fri, Apr 14, 2023 at 04:06:03AM +0000, Trond Myklebust wrote:
>
>
> > On Apr 13, 2023, at 23:51, Al Viro <[email protected]> wrote:
> >
> > On Fri, Apr 14, 2023 at 03:28:45AM +0000, Trond Myklebust wrote:
> >
> >> We already have support for directory file descriptors when mounting with move_mount(). Why not add a umountat() with similar support for the unmount side?
> >> Then add a syscall to allow users with (e.g.) the CAP_DAC_OVERRIDE privilege to convert the mount-id into an O_PATH file descriptor.
> >
> > You can already do umount -l /proc/self/fd/69 if you have a descriptor.
> > Converting mount-id to O_PATH... might be an interesting idea.
>
> A dedicated umountat() might avoid the need for the lazy flag, if it were allowed to close the descriptor on success for the special case of an empty path.

No. It's a wrong abstraction layer, anyway - "close the descriptor"
!= "make the opened file close", nevermind that it's a very odd
corner case that will cause a lot of headache down the road.

2023-04-14 09:53:44

by Christian Brauner

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Fri, Apr 14, 2023 at 04:51:04AM +0100, Al Viro wrote:
> On Fri, Apr 14, 2023 at 03:28:45AM +0000, Trond Myklebust wrote:
>
> > We already have support for directory file descriptors when mounting with move_mount(). Why not add a umountat() with similar support for the unmount side?
> > Then add a syscall to allow users with (e.g.) the CAP_DAC_OVERRIDE privilege to convert the mount-id into an O_PATH file descriptor.
>
> You can already do umount -l /proc/self/fd/69 if you have a descriptor.

Way back when we put together stuff for [2] we had umountat() as an item
but decided against it because it's mostely useful when used as AT_EMPTY_PATH.

umount("/proc/self/fd/<nr>", ...) is useful when you don't trust the
path and you need to resolve it with lookup restrictions. Then path
resolution restrictions of openat2() can be used to get an fd. Which can
be passed to umount().

I need to step outside so this is a halfway-out-the-door thought but
given your description of the problem Jeff, why doesn't the following
work (Just sketching this, you can't call openat2() like that.):

fd_mnt = openat2("/my/funky/nfs/share/mount", RESOLVE_CACHED)
umount("/proc/self/fd/fd_mnt", MNT_DETACH)

> Converting mount-id to O_PATH... might be an interesting idea.

I think using mount-ids would be nice and fwiw, something we considered
as an alternative to umountat(). Not just can they be gotten from
/proc/<pid>/mountinfo but we also do expose the mount id to userspace
nowadays through:

STATX_MNT_ID
__u64 stx_mnt_id;

which also came out of [2]. And it should be safe to do via
AT_STATX_DONT_SYNC:

statx(my_cached_fd, AT_NO_AUTOMOUNT|AT_SYMLINK_NOFOLLOW|AT_STATX_DONT_SYNC)

using STATX_ATTR_MOUNT_ROOT to identify a potential mountpoint. Then
pass that mount-id to the new system call.

[2]: https://github.com/uapi-group/kernel-features

2023-04-14 10:02:45

by Jeff Layton

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Fri, 2023-04-14 at 03:32 +0100, Al Viro wrote:
> On Thu, Apr 13, 2023 at 06:00:42PM -0400, Jeff Layton wrote:
>
> > It describes a situation where there are nested NFS mounts on a client,
> > and one of the intermediate mounts ends up being unexported from the
> > server. In a situation like this, we end up being unable to pathwalk
> > down to the child mount of these unreachable dentries and can't unmount
> > anything, even as root.
>
> So umount -l the stuck sucker. What's the problem with that?
>

You mean lazy umount the parent that is stuck? What happens to the child
mount in that case? Is it also eventually cleaned up?

> > 2/ disallow ->lookup operations: a umount is about removing an existing
> > mount, so the dentries had better already be there.
>
> That changes the semantics; as it is, you need exec permissions on the
> entire path...
>

Yep. But, I think it makes some sense to do so in the context of a
umount. Mostly, umount is done by a privileged user anyway so avoiding
permission checks isn't too great a stretch, IMO.

> > Is this a terrible idea? Are there potentially problems with
> > containerized setups if we were to do something like this? Are there
> > better ways to solve this problem (and others like it)? Maybe this would
> > be best done with a new UMOUNT_CACHED flag for umount2()?
>
> We already have lazy umount. And what will you do to symlinks you run
> into along the way? They *are* traversed; requiring the caller to
> canonicalize them will only shift the problem to userland...

Yeah, I hadn't considered symlinks here. Still, if we have a cached
symlink dentry, wouldn't we also already have the symlink target in
cache too? Or is that not guaranteed?

--
Jeff Layton <[email protected]>

2023-04-14 10:12:44

by Jeff Layton

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Fri, 2023-04-14 at 03:43 +0100, Al Viro wrote:
> On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote:
>
> > The path name that appears in /proc/mounts is the key that must be used
> > to find and unmount a filesystem. When you do that "find"ing you are
> > not looking up a name in a filesystem, you are looking up a key in the
> > mount table.
>
> No. The path name in /proc/mounts is *NOT* a key - it's a best-effort
> attempt to describe the mountpoint. Pathname resolution does not work
> in terms of "the longest prefix is found and we handle the rest within
> that filesystem".
>
> > We could, instead, create an api that is given a mount-id (first number
> > in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could
> > read /proc/self/mountinfo, find the mount-id, and unmount it - all
> > without ever doing path name lookup in the traditional sense.
> >
> > But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed
> > LOOKUP_CACHED, and it only finds paths that are in the dcache, never
> > revalidates, at most performs simple permission checks based on cached
> > content.
>
> umount /proc/self/fd/42/barf/something
>

Does any of that involve talking to the server? I don't necessarily see
a problem with doing the above. If "something" is in cache then that
should still work.

The main idea here is that we want to avoid communicating with the
backing store during the umount(2) pathwalk.

> Discuss.
>
> OTON, umount-by-mount-id is an interesting idea, but we'll need to decide
> what would the right permissions be for it.
>
> But please, lose the "mount table is a mapping from path prefix to filesystem"
> notion - it really, really is not. IIRC, there are systems that work that way,
> but it's nowhere near the semantics used by any Unices, all variants of Linux
> included.

I'm not opposed to something by umount-by-mount-id either. All of this
seems like something that should probably rely on CAP_SYS_ADMIN.
--
Jeff Layton <[email protected]>

2023-04-14 10:14:46

by Jeff Layton

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Fri, 2023-04-14 at 11:41 +0200, Christian Brauner wrote:
> On Fri, Apr 14, 2023 at 04:51:04AM +0100, Al Viro wrote:
> > On Fri, Apr 14, 2023 at 03:28:45AM +0000, Trond Myklebust wrote:
> >
> > > We already have support for directory file descriptors when mounting with move_mount(). Why not add a umountat() with similar support for the unmount side?
> > > Then add a syscall to allow users with (e.g.) the CAP_DAC_OVERRIDE privilege to convert the mount-id into an O_PATH file descriptor.
> >
> > You can already do umount -l /proc/self/fd/69 if you have a descriptor.
>
> Way back when we put together stuff for [2] we had umountat() as an item
> but decided against it because it's mostely useful when used as AT_EMPTY_PATH.
>
> umount("/proc/self/fd/<nr>", ...) is useful when you don't trust the
> path and you need to resolve it with lookup restrictions. Then path
> resolution restrictions of openat2() can be used to get an fd. Which can
> be passed to umount().
>
> I need to step outside so this is a halfway-out-the-door thought but
> given your description of the problem Jeff, why doesn't the following
> work (Just sketching this, you can't call openat2() like that.):
>
> fd_mnt = openat2("/my/funky/nfs/share/mount", RESOLVE_CACHED)
> umount("/proc/self/fd/fd_mnt", MNT_DETACH)

Something like that might work. A RESOLVE_CACHED flag is something that
would involve more than just umount(2) though. That said, it could be
useful in other situations.

>
> > Converting mount-id to O_PATH... might be an interesting idea.
>
> I think using mount-ids would be nice and fwiw, something we considered
> as an alternative to umountat(). Not just can they be gotten from
> /proc/<pid>/mountinfo but we also do expose the mount id to userspace
> nowadays through:
>
> STATX_MNT_ID
> __u64 stx_mnt_id;
>
> which also came out of [2]. And it should be safe to do via
> AT_STATX_DONT_SYNC:
>
> statx(my_cached_fd, AT_NO_AUTOMOUNT|AT_SYMLINK_NOFOLLOW|AT_STATX_DONT_SYNC)
>
> using STATX_ATTR_MOUNT_ROOT to identify a potential mountpoint. Then
> pass that mount-id to the new system call.
>
> [2]: https://github.com/uapi-group/kernel-features

This is generating a lot of good ideas! Maybe we should plan to discuss
this further at LSF/MM?

--
Jeff Layton <[email protected]>

2023-04-14 11:28:45

by Christian Brauner

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Fri, Apr 14, 2023 at 06:09:46AM -0400, Jeff Layton wrote:
> On Fri, 2023-04-14 at 11:41 +0200, Christian Brauner wrote:
> > On Fri, Apr 14, 2023 at 04:51:04AM +0100, Al Viro wrote:
> > > On Fri, Apr 14, 2023 at 03:28:45AM +0000, Trond Myklebust wrote:
> > >
> > > > We already have support for directory file descriptors when mounting with move_mount(). Why not add a umountat() with similar support for the unmount side?
> > > > Then add a syscall to allow users with (e.g.) the CAP_DAC_OVERRIDE privilege to convert the mount-id into an O_PATH file descriptor.
> > >
> > > You can already do umount -l /proc/self/fd/69 if you have a descriptor.
> >
> > Way back when we put together stuff for [2] we had umountat() as an item
> > but decided against it because it's mostely useful when used as AT_EMPTY_PATH.
> >
> > umount("/proc/self/fd/<nr>", ...) is useful when you don't trust the
> > path and you need to resolve it with lookup restrictions. Then path
> > resolution restrictions of openat2() can be used to get an fd. Which can
> > be passed to umount().
> >
> > I need to step outside so this is a halfway-out-the-door thought but
> > given your description of the problem Jeff, why doesn't the following
> > work (Just sketching this, you can't call openat2() like that.):
> >
> > fd_mnt = openat2("/my/funky/nfs/share/mount", RESOLVE_CACHED)
> > umount("/proc/self/fd/fd_mnt", MNT_DETACH)
>
> Something like that might work. A RESOLVE_CACHED flag is something that
> would involve more than just umount(2) though. That said, it could be
> useful in other situations.

I think I introduced an ambiguity by accident. What I meant by "you
can't call openat2() like that" is that it takes a struct open_how
argument not just a simple flags argument.

The flag I was talking about, RESOLVE_CACHED, does exist already. So it
is already possible to use openat2() to resolve paths like that. See
include/uapi/linux/openat2.h

>
> >
> > > Converting mount-id to O_PATH... might be an interesting idea.
> >
> > I think using mount-ids would be nice and fwiw, something we considered
> > as an alternative to umountat(). Not just can they be gotten from
> > /proc/<pid>/mountinfo but we also do expose the mount id to userspace
> > nowadays through:
> >
> > STATX_MNT_ID
> > __u64 stx_mnt_id;
> >
> > which also came out of [2]. And it should be safe to do via
> > AT_STATX_DONT_SYNC:
> >
> > statx(my_cached_fd, AT_NO_AUTOMOUNT|AT_SYMLINK_NOFOLLOW|AT_STATX_DONT_SYNC)
> >
> > using STATX_ATTR_MOUNT_ROOT to identify a potential mountpoint. Then
> > pass that mount-id to the new system call.
> >
> > [2]: https://github.com/uapi-group/kernel-features
>
> This is generating a lot of good ideas! Maybe we should plan to discuss
> this further at LSF/MM?

Sure, happy to.

2023-04-14 12:20:42

by Christian Brauner

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Fri, Apr 14, 2023 at 06:01:59AM -0400, Jeff Layton wrote:
> On Fri, 2023-04-14 at 03:32 +0100, Al Viro wrote:
> > On Thu, Apr 13, 2023 at 06:00:42PM -0400, Jeff Layton wrote:
> >
> > > It describes a situation where there are nested NFS mounts on a client,
> > > and one of the intermediate mounts ends up being unexported from the
> > > server. In a situation like this, we end up being unable to pathwalk
> > > down to the child mount of these unreachable dentries and can't unmount
> > > anything, even as root.
> >
> > So umount -l the stuck sucker. What's the problem with that?
> >
>
> You mean lazy umount the parent that is stuck? What happens to the child
> mount in that case? Is it also eventually cleaned up?

I hope it's ok I barge in to answer this but due to the mount beneath
patches I was working on I did spend even more time in that code then I
already did. So this is good chance to get yelled at if I analyzed these
codepaths wrong.

The child mount would be unmounted in that case. umount_tree() is what
you want to be looking at.

If you perform a regular umount _without_ MNT_DETACH you can see that
umount_tree() is effectively guarded by a call to propagate_mount_busy().
It checks wether the direct umount target has any child mounts and if so
refuses the umount with EBUSY:

mkdir -p /mnt/a/b /mnt/c /mnt/d

# Create parent mount of a@c
mount --bind /mnt/a /mnt/c

# create child d@b which as child mount of a@c
mount --bind /mnt/d /mnt/c/b

If you call umount /mnt/c it will fail because a@c has child mounts.
If you do a lazy umount via MNT_DETACH through umount -l /mnt/c then it
will also unmount all children of a@c. In fact it will even include
children of children...

mkdir /mnt/c/b/e
mount --bind /mnt/a/b/ /mnt/c/b/e
umount -l /mnt/c

That's basically what the next_mnt() loop at the beginning of
umount_tree() is doing where it collects all direct targets to umount.

However, if mount propagation is in play things get a lot nastier as you
can fail a non-MNT_DETACH umount because of it as well (Note that umount
propagation is always triggered if the parent mount of your direct
umount target is a shared mount. IOW, you can't easily opt out of it
unless you make the parent mount of your immediate umount target a
non-shared mount.).

A trivial reason that comes to mind where you would fail the umount due
to mount propagation would where a propagated mount is kept busy and not
the original mount. So similar to above on the host do:

mkdir -p /mnt/a/b /mnt/c /mnt/d
mount --bind /mnt/a /mnt/c
umount /mnt/c

and you would expect the umount /mnt/c to work. But you realize it fails
with EBUSY but noone is referencing that mount anymore at least not in
an obvious way.

But assume someone had a mount namespace open that receives mount
propagation from /. In that case the a@c mount would have propagated
into that mount namespace. So someone could've cd /mnt/c into that
propagated mount and the umount /mnt/c would fail.

In that case propagate_mount_busy() would detect the increased refcount
when it tries to check whether the umount could be propagated and give
you EBUSY. So here you also need a lazy umount to get rid of that
mount... And there are other nice scenarios where that's hard to figure
out.

2023-04-14 12:42:03

by Jeff Layton

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Fri, 2023-04-14 at 13:16 +0200, Christian Brauner wrote:
> On Fri, Apr 14, 2023 at 06:09:46AM -0400, Jeff Layton wrote:
> > On Fri, 2023-04-14 at 11:41 +0200, Christian Brauner wrote:
> > > On Fri, Apr 14, 2023 at 04:51:04AM +0100, Al Viro wrote:
> > > > On Fri, Apr 14, 2023 at 03:28:45AM +0000, Trond Myklebust wrote:
> > > >
> > > > > We already have support for directory file descriptors when mounting with move_mount(). Why not add a umountat() with similar support for the unmount side?
> > > > > Then add a syscall to allow users with (e.g.) the CAP_DAC_OVERRIDE privilege to convert the mount-id into an O_PATH file descriptor.
> > > >
> > > > You can already do umount -l /proc/self/fd/69 if you have a descriptor.
> > >
> > > Way back when we put together stuff for [2] we had umountat() as an item
> > > but decided against it because it's mostely useful when used as AT_EMPTY_PATH.
> > >
> > > umount("/proc/self/fd/<nr>", ...) is useful when you don't trust the
> > > path and you need to resolve it with lookup restrictions. Then path
> > > resolution restrictions of openat2() can be used to get an fd. Which can
> > > be passed to umount().
> > >
> > > I need to step outside so this is a halfway-out-the-door thought but
> > > given your description of the problem Jeff, why doesn't the following
> > > work (Just sketching this, you can't call openat2() like that.):
> > >
> > > fd_mnt = openat2("/my/funky/nfs/share/mount", RESOLVE_CACHED)
> > > umount("/proc/self/fd/fd_mnt", MNT_DETACH)
> >
> > Something like that might work. A RESOLVE_CACHED flag is something that
> > would involve more than just umount(2) though. That said, it could be
> > useful in other situations.
>
> I think I introduced an ambiguity by accident. What I meant by "you
> can't call openat2() like that" is that it takes a struct open_how
> argument not just a simple flags argument.
>
> The flag I was talking about, RESOLVE_CACHED, does exist already. So it
> is already possible to use openat2() to resolve paths like that. See
> include/uapi/linux/openat2.h
>

Oh, thanks! I haven't tried this yet, but I doubt it'd fix the problem
David was reproducing. There, the problem is mostly permission checks
during the pathwalk. It tries to contact the server to check the
permission, but because it's unexported, EACCES bubbles up.

Also, to follow up: I tried Al's suggestion of lazily unmounting the bad
intermediate mount closest to the root, and it does seem to clean up
child mounts as expected. So, it does look like there is recourse in
current kernels, even if the method is a bit unintuitive.

Still, it might be worth discussing at LSF or here on the list. It'd be
nice if umount "just worked" in this situation. Is there any reason to
do a full-blown d_revalidating and permission-checking pathwalk on
umount, assuming the caller already has CAP_SYS_ADMIN or similar?

--
Jeff Layton <[email protected]>

2023-04-14 12:55:24

by Christian Brauner

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Fri, Apr 14, 2023 at 08:33:09AM -0400, Jeff Layton wrote:
> On Fri, 2023-04-14 at 13:16 +0200, Christian Brauner wrote:
> > On Fri, Apr 14, 2023 at 06:09:46AM -0400, Jeff Layton wrote:
> > > On Fri, 2023-04-14 at 11:41 +0200, Christian Brauner wrote:
> > > > On Fri, Apr 14, 2023 at 04:51:04AM +0100, Al Viro wrote:
> > > > > On Fri, Apr 14, 2023 at 03:28:45AM +0000, Trond Myklebust wrote:
> > > > >
> > > > > > We already have support for directory file descriptors when mounting with move_mount(). Why not add a umountat() with similar support for the unmount side?
> > > > > > Then add a syscall to allow users with (e.g.) the CAP_DAC_OVERRIDE privilege to convert the mount-id into an O_PATH file descriptor.
> > > > >
> > > > > You can already do umount -l /proc/self/fd/69 if you have a descriptor.
> > > >
> > > > Way back when we put together stuff for [2] we had umountat() as an item
> > > > but decided against it because it's mostely useful when used as AT_EMPTY_PATH.
> > > >
> > > > umount("/proc/self/fd/<nr>", ...) is useful when you don't trust the
> > > > path and you need to resolve it with lookup restrictions. Then path
> > > > resolution restrictions of openat2() can be used to get an fd. Which can
> > > > be passed to umount().
> > > >
> > > > I need to step outside so this is a halfway-out-the-door thought but
> > > > given your description of the problem Jeff, why doesn't the following
> > > > work (Just sketching this, you can't call openat2() like that.):
> > > >
> > > > fd_mnt = openat2("/my/funky/nfs/share/mount", RESOLVE_CACHED)
> > > > umount("/proc/self/fd/fd_mnt", MNT_DETACH)
> > >
> > > Something like that might work. A RESOLVE_CACHED flag is something that
> > > would involve more than just umount(2) though. That said, it could be
> > > useful in other situations.
> >
> > I think I introduced an ambiguity by accident. What I meant by "you
> > can't call openat2() like that" is that it takes a struct open_how
> > argument not just a simple flags argument.
> >
> > The flag I was talking about, RESOLVE_CACHED, does exist already. So it
> > is already possible to use openat2() to resolve paths like that. See
> > include/uapi/linux/openat2.h
> >
>
> Oh, thanks! I haven't tried this yet, but I doubt it'd fix the problem
> David was reproducing. There, the problem is mostly permission checks
> during the pathwalk. It tries to contact the server to check the
> permission, but because it's unexported, EACCES bubbles up.
>
> Also, to follow up: I tried Al's suggestion of lazily unmounting the bad
> intermediate mount closest to the root, and it does seem to clean up
> child mounts as expected. So, it does look like there is recourse in
> current kernels, even if the method is a bit unintuitive.

See my other mail about this.

Lazy umount is often the way to go. In fact, there'll be mount trees
that you cannot unmount without MNT_DETACH because of mount propagation.

>
> Still, it might be worth discussing at LSF or here on the list. It'd be
> nice if umount "just worked" in this situation. Is there any reason to
> do a full-blown d_revalidating and permission-checking pathwalk on
> umount, assuming the caller already has CAP_SYS_ADMIN or similar?

I find this to be conceptually the wrong way of doing this. Because even
if you can piggy-back on LOOKUP_CACHED you probably still need special
cases handling for umount. And at that point I think a umount by mnt-id
is just nicer. Because you would also need to expose the new umount()
behavior under a new flag otherwise you risk breaking userspace
assumptions.

2023-04-14 13:20:48

by David Wysochanski

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Thu, Apr 13, 2023 at 10:34 PM Al Viro <[email protected]> wrote:
>
> On Thu, Apr 13, 2023 at 06:00:42PM -0400, Jeff Layton wrote:
>
> > It describes a situation where there are nested NFS mounts on a client,
> > and one of the intermediate mounts ends up being unexported from the
> > server. In a situation like this, we end up being unable to pathwalk
> > down to the child mount of these unreachable dentries and can't unmount
> > anything, even as root.
>
> So umount -l the stuck sucker. What's the problem with that?
>

Speaking in the context of the original reproducer:
This does work if you "umount -l /A" - the underlying "/A" mount,
but not if you do "umount -l /A/B" (both have lost their access on
the server).

The problem with this is IMO, it is not very intuitive to lazy unmount
the root of a whole tree like that. Also, the customer's case was autofs,
and I don't think autofs does this, it tries to umount the children
first and gets stuck there in this scenario.

But overall yes it does make sense, IMO.

> > 2/ disallow ->lookup operations: a umount is about removing an existing
> > mount, so the dentries had better already be there.
>
> That changes the semantics; as it is, you need exec permissions on the
> entire path...
>
> > Is this a terrible idea? Are there potentially problems with
> > containerized setups if we were to do something like this? Are there
> > better ways to solve this problem (and others like it)? Maybe this would
> > be best done with a new UMOUNT_CACHED flag for umount2()?
>
> We already have lazy umount. And what will you do to symlinks you run
> into along the way? They *are* traversed; requiring the caller to
> canonicalize them will only shift the problem to userland...
>

2023-04-14 13:45:51

by Christian Brauner

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Fri, Apr 14, 2023 at 06:06:38AM -0400, Jeff Layton wrote:
> On Fri, 2023-04-14 at 03:43 +0100, Al Viro wrote:
> > On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote:
> >
> > > The path name that appears in /proc/mounts is the key that must be used
> > > to find and unmount a filesystem. When you do that "find"ing you are
> > > not looking up a name in a filesystem, you are looking up a key in the
> > > mount table.
> >
> > No. The path name in /proc/mounts is *NOT* a key - it's a best-effort
> > attempt to describe the mountpoint. Pathname resolution does not work
> > in terms of "the longest prefix is found and we handle the rest within
> > that filesystem".
> >
> > > We could, instead, create an api that is given a mount-id (first number
> > > in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could
> > > read /proc/self/mountinfo, find the mount-id, and unmount it - all
> > > without ever doing path name lookup in the traditional sense.
> > >
> > > But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed
> > > LOOKUP_CACHED, and it only finds paths that are in the dcache, never
> > > revalidates, at most performs simple permission checks based on cached
> > > content.
> >
> > umount /proc/self/fd/42/barf/something
> >
>
> Does any of that involve talking to the server? I don't necessarily see
> a problem with doing the above. If "something" is in cache then that
> should still work.
>
> The main idea here is that we want to avoid communicating with the
> backing store during the umount(2) pathwalk.
>
> > Discuss.
> >
> > OTON, umount-by-mount-id is an interesting idea, but we'll need to decide
> > what would the right permissions be for it.
> >
> > But please, lose the "mount table is a mapping from path prefix to filesystem"
> > notion - it really, really is not. IIRC, there are systems that work that way,
> > but it's nowhere near the semantics used by any Unices, all variants of Linux
> > included.
>
> I'm not opposed to something by umount-by-mount-id either. All of this
> seems like something that should probably rely on CAP_SYS_ADMIN.

The permission model needs to account for the fact that mount ids are
global and as such you could in principle unmount any mount in any mount
namespace. IOW, you can circumvent lookup restrictions completely.

So we could resolve the mnt-id to an FMODE_PATH and then very roughly
with no claim to solving everything:

may_umount_by_mnt_id(struct path *opath)
{
struct path root;
bool reachable;

// caller in principle able to circumvent lookup restrictions
if (!may_cap_dac_readsearch())
return false;

// caller can mount in their mountns
if (!may_mount())
return false;

// target mount and caller in the same mountns
if (!check_mnt())
return false;

// caller could in principle reach mount from it's root
get_fs_root(current->fs, &root);
reachable = is_path_reachable(real_mount(opath->mnt), opath->dentry, &root);
path_put(&root);

return reachable;
}

However, that still means that we have laxer restrictions on unmounting
by mount-id then on unmount with lookup as for lookup just having
CAP_DAC_READ_SEARCH isn't enough. Usually - at least for filesytems
without custom permission handlers - we also establish that the inode
can be mapped into the caller's idmapping.

So that would mean that unmounting by mount-id would allow you to
unmount mounts in cases where you wouldn't with umount. That might be
fine though as that's ultimately the goal here in a way.

One could also see a very useful feature in this where you require
capable(CAP_DAC_READ_SEARCH) and capable(CAP_SYS_ADMIN) and then allow
unmounting any mount in the system by mount-id. This would obviously be
very useful for privileged service managers but I haven't thought this
through.

2023-04-14 14:27:24

by Trond Myklebust

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk



> On Apr 14, 2023, at 09:41, Christian Brauner <[email protected]> wrote:
>
> On Fri, Apr 14, 2023 at 06:06:38AM -0400, Jeff Layton wrote:
>> On Fri, 2023-04-14 at 03:43 +0100, Al Viro wrote:
>>> On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote:
>>>
>>>> The path name that appears in /proc/mounts is the key that must be used
>>>> to find and unmount a filesystem. When you do that "find"ing you are
>>>> not looking up a name in a filesystem, you are looking up a key in the
>>>> mount table.
>>>
>>> No. The path name in /proc/mounts is *NOT* a key - it's a best-effort
>>> attempt to describe the mountpoint. Pathname resolution does not work
>>> in terms of "the longest prefix is found and we handle the rest within
>>> that filesystem".
>>>
>>>> We could, instead, create an api that is given a mount-id (first number
>>>> in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could
>>>> read /proc/self/mountinfo, find the mount-id, and unmount it - all
>>>> without ever doing path name lookup in the traditional sense.
>>>>
>>>> But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed
>>>> LOOKUP_CACHED, and it only finds paths that are in the dcache, never
>>>> revalidates, at most performs simple permission checks based on cached
>>>> content.
>>>
>>> umount /proc/self/fd/42/barf/something
>>>
>>
>> Does any of that involve talking to the server? I don't necessarily see
>> a problem with doing the above. If "something" is in cache then that
>> should still work.
>>
>> The main idea here is that we want to avoid communicating with the
>> backing store during the umount(2) pathwalk.
>>
>>> Discuss.
>>>
>>> OTON, umount-by-mount-id is an interesting idea, but we'll need to decide
>>> what would the right permissions be for it.
>>>
>>> But please, lose the "mount table is a mapping from path prefix to filesystem"
>>> notion - it really, really is not. IIRC, there are systems that work that way,
>>> but it's nowhere near the semantics used by any Unices, all variants of Linux
>>> included.
>>
>> I'm not opposed to something by umount-by-mount-id either. All of this
>> seems like something that should probably rely on CAP_SYS_ADMIN.
>
> The permission model needs to account for the fact that mount ids are
> global and as such you could in principle unmount any mount in any mount
> namespace. IOW, you can circumvent lookup restrictions completely.
>
> So we could resolve the mnt-id to an FMODE_PATH and then very roughly
> with no claim to solving everything:
>
> may_umount_by_mnt_id(struct path *opath)
> {
> struct path root;
> bool reachable;
>
> // caller in principle able to circumvent lookup restrictions
> if (!may_cap_dac_readsearch())
> return false;
>
> // caller can mount in their mountns
> if (!may_mount())
> return false;
>
> // target mount and caller in the same mountns
> if (!check_mnt())
> return false;
>
> // caller could in principle reach mount from it's root
> get_fs_root(current->fs, &root);
> reachable = is_path_reachable(real_mount(opath->mnt), opath->dentry, &root);
> path_put(&root);
>
> return reachable;
> }
>
> However, that still means that we have laxer restrictions on unmounting
> by mount-id then on unmount with lookup as for lookup just having
> CAP_DAC_READ_SEARCH isn't enough. Usually - at least for filesytems
> without custom permission handlers - we also establish that the inode
> can be mapped into the caller's idmapping.
>
> So that would mean that unmounting by mount-id would allow you to
> unmount mounts in cases where you wouldn't with umount. That might be
> fine though as that's ultimately the goal here in a way.
>
> One could also see a very useful feature in this where you require
> capable(CAP_DAC_READ_SEARCH) and capable(CAP_SYS_ADMIN) and then allow
> unmounting any mount in the system by mount-id. This would obviously be
> very useful for privileged service managers but I haven't thought this
> Through.

That is exactly why having a separate syscall to do the lookup of the mount-id is good: it provides separation of privilege.

The conversion of mount-id to an O_PATH file descriptor is just akin to a path lookup, so only needs CAP_DAC_READ_SEARCH (since you require privilege only to bypass the ACL directory read and lookup restrictions). The resulting O_PATH file descriptor has no special properties that require any further privilege.

Then use that resulting file descriptor for the umount, which normally requires CAP_SYS_ADMIN.

_________________________________
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2023-04-14 14:59:39

by Al Viro

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Fri, Apr 14, 2023 at 06:01:59AM -0400, Jeff Layton wrote:
> On Fri, 2023-04-14 at 03:32 +0100, Al Viro wrote:
> > On Thu, Apr 13, 2023 at 06:00:42PM -0400, Jeff Layton wrote:
> >
> > > It describes a situation where there are nested NFS mounts on a client,
> > > and one of the intermediate mounts ends up being unexported from the
> > > server. In a situation like this, we end up being unable to pathwalk
> > > down to the child mount of these unreachable dentries and can't unmount
> > > anything, even as root.
> >
> > So umount -l the stuck sucker. What's the problem with that?
> >
>
> You mean lazy umount the parent that is stuck? What happens to the child
> mount in that case? Is it also eventually cleaned up?

Yes. Lazy umount takes out the contributions to refcount due to being
present in mount tree; as soon as other references go away (opened
files, current directories, in-progress syscalls on files in there) struct
mount instance releases the active reference to filesystem instance
(struct super_block) and goes away. Once all active references to
filesystem instance are gone, it gets shut down.

> Yeah, I hadn't considered symlinks here. Still, if we have a cached
> symlink dentry, wouldn't we also already have the symlink target in
> cache too? Or is that not guaranteed?

Not at all. What's more, why would we have that symlink dentry cached
in the first place? If you suddenly impose that kind of restrictions
on umount(2), you are pretty much guaranteed to break userland...

Symlink dentries are rarely pinned, BTW - they may very well be
cached if we hit them often enough, but outside of the things
like lchown()/lstat() or pathname resolution in progress that is
currently traversing that symlink... refcount is going to be zero.

2023-04-14 15:15:43

by Christian Brauner

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Fri, Apr 14, 2023 at 02:21:00PM +0000, Trond Myklebust wrote:
>
>
> > On Apr 14, 2023, at 09:41, Christian Brauner <[email protected]> wrote:
> >
> > On Fri, Apr 14, 2023 at 06:06:38AM -0400, Jeff Layton wrote:
> >> On Fri, 2023-04-14 at 03:43 +0100, Al Viro wrote:
> >>> On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote:
> >>>
> >>>> The path name that appears in /proc/mounts is the key that must be used
> >>>> to find and unmount a filesystem. When you do that "find"ing you are
> >>>> not looking up a name in a filesystem, you are looking up a key in the
> >>>> mount table.
> >>>
> >>> No. The path name in /proc/mounts is *NOT* a key - it's a best-effort
> >>> attempt to describe the mountpoint. Pathname resolution does not work
> >>> in terms of "the longest prefix is found and we handle the rest within
> >>> that filesystem".
> >>>
> >>>> We could, instead, create an api that is given a mount-id (first number
> >>>> in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could
> >>>> read /proc/self/mountinfo, find the mount-id, and unmount it - all
> >>>> without ever doing path name lookup in the traditional sense.
> >>>>
> >>>> But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed
> >>>> LOOKUP_CACHED, and it only finds paths that are in the dcache, never
> >>>> revalidates, at most performs simple permission checks based on cached
> >>>> content.
> >>>
> >>> umount /proc/self/fd/42/barf/something
> >>>
> >>
> >> Does any of that involve talking to the server? I don't necessarily see
> >> a problem with doing the above. If "something" is in cache then that
> >> should still work.
> >>
> >> The main idea here is that we want to avoid communicating with the
> >> backing store during the umount(2) pathwalk.
> >>
> >>> Discuss.
> >>>
> >>> OTON, umount-by-mount-id is an interesting idea, but we'll need to decide
> >>> what would the right permissions be for it.
> >>>
> >>> But please, lose the "mount table is a mapping from path prefix to filesystem"
> >>> notion - it really, really is not. IIRC, there are systems that work that way,
> >>> but it's nowhere near the semantics used by any Unices, all variants of Linux
> >>> included.
> >>
> >> I'm not opposed to something by umount-by-mount-id either. All of this
> >> seems like something that should probably rely on CAP_SYS_ADMIN.
> >
> > The permission model needs to account for the fact that mount ids are
> > global and as such you could in principle unmount any mount in any mount
> > namespace. IOW, you can circumvent lookup restrictions completely.
> >
> > So we could resolve the mnt-id to an FMODE_PATH and then very roughly
> > with no claim to solving everything:
> >
> > may_umount_by_mnt_id(struct path *opath)
> > {
> > struct path root;
> > bool reachable;
> >
> > // caller in principle able to circumvent lookup restrictions
> > if (!may_cap_dac_readsearch())
> > return false;
> >
> > // caller can mount in their mountns
> > if (!may_mount())
> > return false;
> >
> > // target mount and caller in the same mountns
> > if (!check_mnt())
> > return false;
> >
> > // caller could in principle reach mount from it's root
> > get_fs_root(current->fs, &root);
> > reachable = is_path_reachable(real_mount(opath->mnt), opath->dentry, &root);
> > path_put(&root);
> >
> > return reachable;
> > }
> >
> > However, that still means that we have laxer restrictions on unmounting
> > by mount-id then on unmount with lookup as for lookup just having
> > CAP_DAC_READ_SEARCH isn't enough. Usually - at least for filesytems
> > without custom permission handlers - we also establish that the inode
> > can be mapped into the caller's idmapping.
> >
> > So that would mean that unmounting by mount-id would allow you to
> > unmount mounts in cases where you wouldn't with umount. That might be
> > fine though as that's ultimately the goal here in a way.
> >
> > One could also see a very useful feature in this where you require
> > capable(CAP_DAC_READ_SEARCH) and capable(CAP_SYS_ADMIN) and then allow
> > unmounting any mount in the system by mount-id. This would obviously be
> > very useful for privileged service managers but I haven't thought this
> > Through.
>
> That is exactly why having a separate syscall to do the lookup of the mount-id is good: it provides separation of privilege.
>
> The conversion of mount-id to an O_PATH file descriptor is just akin to a path lookup, so only needs CAP_DAC_READ_SEARCH (since you require privilege only to bypass the ACL directory read and lookup restrictions). The resulting O_PATH file descriptor has no special properties that require any further privilege.
>
> Then use that resulting file descriptor for the umount, which normally requires CAP_SYS_ADMIN.

There's a difference between unmounting directly by providing a mount id
and getting an O_PATH file descriptor from a mnt-id. If you can simply
unmount by mount-id it's useful for users that have CAP_DAC_READ_SEARCH
in a container. Without it you likely need to require
capable(CAP_DAC_READ_SEARCH) aka system level privileges just like
open_to_handle_at() which makes this interface way less generic and
usable. Otherwise you'd be able to get an O_PATH fd to something that
you wouldn't be able to access through normal path lookup.

2023-04-14 15:32:06

by Trond Myklebust

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk



> On Apr 14, 2023, at 11:13, Christian Brauner <[email protected]> wrote:
>
> On Fri, Apr 14, 2023 at 02:21:00PM +0000, Trond Myklebust wrote:
>>
>>
>>> On Apr 14, 2023, at 09:41, Christian Brauner <[email protected]> wrote:
>>>
>>> On Fri, Apr 14, 2023 at 06:06:38AM -0400, Jeff Layton wrote:
>>>> On Fri, 2023-04-14 at 03:43 +0100, Al Viro wrote:
>>>>> On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote:
>>>>>
>>>>>> The path name that appears in /proc/mounts is the key that must be used
>>>>>> to find and unmount a filesystem. When you do that "find"ing you are
>>>>>> not looking up a name in a filesystem, you are looking up a key in the
>>>>>> mount table.
>>>>>
>>>>> No. The path name in /proc/mounts is *NOT* a key - it's a best-effort
>>>>> attempt to describe the mountpoint. Pathname resolution does not work
>>>>> in terms of "the longest prefix is found and we handle the rest within
>>>>> that filesystem".
>>>>>
>>>>>> We could, instead, create an api that is given a mount-id (first number
>>>>>> in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could
>>>>>> read /proc/self/mountinfo, find the mount-id, and unmount it - all
>>>>>> without ever doing path name lookup in the traditional sense.
>>>>>>
>>>>>> But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed
>>>>>> LOOKUP_CACHED, and it only finds paths that are in the dcache, never
>>>>>> revalidates, at most performs simple permission checks based on cached
>>>>>> content.
>>>>>
>>>>> umount /proc/self/fd/42/barf/something
>>>>>
>>>>
>>>> Does any of that involve talking to the server? I don't necessarily see
>>>> a problem with doing the above. If "something" is in cache then that
>>>> should still work.
>>>>
>>>> The main idea here is that we want to avoid communicating with the
>>>> backing store during the umount(2) pathwalk.
>>>>
>>>>> Discuss.
>>>>>
>>>>> OTON, umount-by-mount-id is an interesting idea, but we'll need to decide
>>>>> what would the right permissions be for it.
>>>>>
>>>>> But please, lose the "mount table is a mapping from path prefix to filesystem"
>>>>> notion - it really, really is not. IIRC, there are systems that work that way,
>>>>> but it's nowhere near the semantics used by any Unices, all variants of Linux
>>>>> included.
>>>>
>>>> I'm not opposed to something by umount-by-mount-id either. All of this
>>>> seems like something that should probably rely on CAP_SYS_ADMIN.
>>>
>>> The permission model needs to account for the fact that mount ids are
>>> global and as such you could in principle unmount any mount in any mount
>>> namespace. IOW, you can circumvent lookup restrictions completely.
>>>
>>> So we could resolve the mnt-id to an FMODE_PATH and then very roughly
>>> with no claim to solving everything:
>>>
>>> may_umount_by_mnt_id(struct path *opath)
>>> {
>>> struct path root;
>>> bool reachable;
>>>
>>> // caller in principle able to circumvent lookup restrictions
>>> if (!may_cap_dac_readsearch())
>>> return false;
>>>
>>> // caller can mount in their mountns
>>> if (!may_mount())
>>> return false;
>>>
>>> // target mount and caller in the same mountns
>>> if (!check_mnt())
>>> return false;
>>>
>>> // caller could in principle reach mount from it's root
>>> get_fs_root(current->fs, &root);
>>> reachable = is_path_reachable(real_mount(opath->mnt), opath->dentry, &root);
>>> path_put(&root);
>>>
>>> return reachable;
>>> }
>>>
>>> However, that still means that we have laxer restrictions on unmounting
>>> by mount-id then on unmount with lookup as for lookup just having
>>> CAP_DAC_READ_SEARCH isn't enough. Usually - at least for filesytems
>>> without custom permission handlers - we also establish that the inode
>>> can be mapped into the caller's idmapping.
>>>
>>> So that would mean that unmounting by mount-id would allow you to
>>> unmount mounts in cases where you wouldn't with umount. That might be
>>> fine though as that's ultimately the goal here in a way.
>>>
>>> One could also see a very useful feature in this where you require
>>> capable(CAP_DAC_READ_SEARCH) and capable(CAP_SYS_ADMIN) and then allow
>>> unmounting any mount in the system by mount-id. This would obviously be
>>> very useful for privileged service managers but I haven't thought this
>>> Through.
>>
>> That is exactly why having a separate syscall to do the lookup of the mount-id is good: it provides separation of privilege.
>>
>> The conversion of mount-id to an O_PATH file descriptor is just akin to a path lookup, so only needs CAP_DAC_READ_SEARCH (since you require privilege only to bypass the ACL directory read and lookup restrictions). The resulting O_PATH file descriptor has no special properties that require any further privilege.
>>
>> Then use that resulting file descriptor for the umount, which normally requires CAP_SYS_ADMIN.
>
> There's a difference between unmounting directly by providing a mount id
> and getting an O_PATH file descriptor from a mnt-id. If you can simply
> unmount by mount-id it's useful for users that have CAP_DAC_READ_SEARCH
> in a container. Without it you likely need to require
> capable(CAP_DAC_READ_SEARCH) aka system level privileges just like
> open_to_handle_at() which makes this interface way less generic and
> usable. Otherwise you'd be able to get an O_PATH fd to something that
> you wouldn't be able to access through normal path lookup.


Being able to convert into an O_PATH descriptor gives you more options than just unmounting. It should allow you to syncfs() before unmounting. It should allow you to call open_tree() so you can manipulate the filesystem that is no longer accessible by path walk (e.g. so you can bind it elsewhere or move it).

_________________________________
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2023-04-14 16:04:25

by Trond Myklebust

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk


> On Apr 14, 2023, at 11:30, Trond Myklebust <[email protected]> wrote:
>
>
>
>> On Apr 14, 2023, at 11:13, Christian Brauner <[email protected]> wrote:
>>
>> On Fri, Apr 14, 2023 at 02:21:00PM +0000, Trond Myklebust wrote:
>>>
>>>
>>>> On Apr 14, 2023, at 09:41, Christian Brauner <[email protected]> wrote:
>>>>
>>>> On Fri, Apr 14, 2023 at 06:06:38AM -0400, Jeff Layton wrote:
>>>>> On Fri, 2023-04-14 at 03:43 +0100, Al Viro wrote:
>>>>>> On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote:
>>>>>>
>>>>>>> The path name that appears in /proc/mounts is the key that must be used
>>>>>>> to find and unmount a filesystem. When you do that "find"ing you are
>>>>>>> not looking up a name in a filesystem, you are looking up a key in the
>>>>>>> mount table.
>>>>>>
>>>>>> No. The path name in /proc/mounts is *NOT* a key - it's a best-effort
>>>>>> attempt to describe the mountpoint. Pathname resolution does not work
>>>>>> in terms of "the longest prefix is found and we handle the rest within
>>>>>> that filesystem".
>>>>>>
>>>>>>> We could, instead, create an api that is given a mount-id (first number
>>>>>>> in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could
>>>>>>> read /proc/self/mountinfo, find the mount-id, and unmount it - all
>>>>>>> without ever doing path name lookup in the traditional sense.
>>>>>>>
>>>>>>> But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed
>>>>>>> LOOKUP_CACHED, and it only finds paths that are in the dcache, never
>>>>>>> revalidates, at most performs simple permission checks based on cached
>>>>>>> content.
>>>>>>
>>>>>> umount /proc/self/fd/42/barf/something
>>>>>>
>>>>>
>>>>> Does any of that involve talking to the server? I don't necessarily see
>>>>> a problem with doing the above. If "something" is in cache then that
>>>>> should still work.
>>>>>
>>>>> The main idea here is that we want to avoid communicating with the
>>>>> backing store during the umount(2) pathwalk.
>>>>>
>>>>>> Discuss.
>>>>>>
>>>>>> OTON, umount-by-mount-id is an interesting idea, but we'll need to decide
>>>>>> what would the right permissions be for it.
>>>>>>
>>>>>> But please, lose the "mount table is a mapping from path prefix to filesystem"
>>>>>> notion - it really, really is not. IIRC, there are systems that work that way,
>>>>>> but it's nowhere near the semantics used by any Unices, all variants of Linux
>>>>>> included.
>>>>>
>>>>> I'm not opposed to something by umount-by-mount-id either. All of this
>>>>> seems like something that should probably rely on CAP_SYS_ADMIN.
>>>>
>>>> The permission model needs to account for the fact that mount ids are
>>>> global and as such you could in principle unmount any mount in any mount
>>>> namespace. IOW, you can circumvent lookup restrictions completely.
>>>>
>>>> So we could resolve the mnt-id to an FMODE_PATH and then very roughly
>>>> with no claim to solving everything:
>>>>
>>>> may_umount_by_mnt_id(struct path *opath)
>>>> {
>>>> struct path root;
>>>> bool reachable;
>>>>
>>>> // caller in principle able to circumvent lookup restrictions
>>>> if (!may_cap_dac_readsearch())
>>>> return false;
>>>>
>>>> // caller can mount in their mountns
>>>> if (!may_mount())
>>>> return false;
>>>>
>>>> // target mount and caller in the same mountns
>>>> if (!check_mnt())
>>>> return false;
>>>>
>>>> // caller could in principle reach mount from it's root
>>>> get_fs_root(current->fs, &root);
>>>> reachable = is_path_reachable(real_mount(opath->mnt), opath->dentry, &root);
>>>> path_put(&root);
>>>>
>>>> return reachable;
>>>> }
>>>>
>>>> However, that still means that we have laxer restrictions on unmounting
>>>> by mount-id then on unmount with lookup as for lookup just having
>>>> CAP_DAC_READ_SEARCH isn't enough. Usually - at least for filesytems
>>>> without custom permission handlers - we also establish that the inode
>>>> can be mapped into the caller's idmapping.
>>>>
>>>> So that would mean that unmounting by mount-id would allow you to
>>>> unmount mounts in cases where you wouldn't with umount. That might be
>>>> fine though as that's ultimately the goal here in a way.
>>>>
>>>> One could also see a very useful feature in this where you require
>>>> capable(CAP_DAC_READ_SEARCH) and capable(CAP_SYS_ADMIN) and then allow
>>>> unmounting any mount in the system by mount-id. This would obviously be
>>>> very useful for privileged service managers but I haven't thought this
>>>> Through.
>>>
>>> That is exactly why having a separate syscall to do the lookup of the mount-id is good: it provides separation of privilege.
>>>
>>> The conversion of mount-id to an O_PATH file descriptor is just akin to a path lookup, so only needs CAP_DAC_READ_SEARCH (since you require privilege only to bypass the ACL directory read and lookup restrictions). The resulting O_PATH file descriptor has no special properties that require any further privilege.
>>>
>>> Then use that resulting file descriptor for the umount, which normally requires CAP_SYS_ADMIN.
>>
>> There's a difference between unmounting directly by providing a mount id
>> and getting an O_PATH file descriptor from a mnt-id. If you can simply
>> unmount by mount-id it's useful for users that have CAP_DAC_READ_SEARCH
>> in a container. Without it you likely need to require
>> capable(CAP_DAC_READ_SEARCH) aka system level privileges just like
>> open_to_handle_at() which makes this interface way less generic and
>> usable. Otherwise you'd be able to get an O_PATH fd to something that
>> you wouldn't be able to access through normal path lookup.
>
>
> Being able to convert into an O_PATH descriptor gives you more options than just unmounting. It should allow you to syncfs() before unmounting. It should allow you to call open_tree() so you can manipulate the filesystem that is no longer accessible by path walk (e.g. so you can bind it elsewhere or move it).
>

One more thing it might allow us to do, which I’ve been wanting for a while in NFS: allow us to flip the mount type from being “hard” to “soft” before doing the lazy unmount, so that any application that might still retry I/O after the call to umount_begin() completes will start timing out with an I/O error, and free up the resources it might otherwise hold forever.

_________________________________
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2023-04-14 16:27:21

by Al Viro

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Fri, Apr 14, 2023 at 03:57:34PM +0000, Trond Myklebust wrote:

> >
> > Being able to convert into an O_PATH descriptor gives you more options than just unmounting. It should allow you to syncfs() before unmounting. It should allow you to call open_tree() so you can manipulate the filesystem that is no longer accessible by path walk (e.g. so you can bind it elsewhere or move it).
> >
>
> One more thing it might allow us to do, which I’ve been wanting for a while in NFS: allow us to flip the mount type from being “hard” to “soft” before doing the lazy unmount, so that any application that might still retry I/O after the call to umount_begin() completes will start timing out with an I/O error, and free up the resources it might otherwise hold forever.
>

s/lazy/forced/, surely? Confused...

Note, BTW, that hard vs. soft is a property of fs instance; if you have
it present elsewhere in the mount tree, flipping it would affect all
such places. I don't see any good way to make it a per-mount thing, TBH...

2023-04-14 16:38:15

by Christian Brauner

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Fri, Apr 14, 2023 at 03:30:30PM +0000, Trond Myklebust wrote:
>
>
> > On Apr 14, 2023, at 11:13, Christian Brauner <[email protected]> wrote:
> >
> > On Fri, Apr 14, 2023 at 02:21:00PM +0000, Trond Myklebust wrote:
> >>
> >>
> >>> On Apr 14, 2023, at 09:41, Christian Brauner <[email protected]> wrote:
> >>>
> >>> On Fri, Apr 14, 2023 at 06:06:38AM -0400, Jeff Layton wrote:
> >>>> On Fri, 2023-04-14 at 03:43 +0100, Al Viro wrote:
> >>>>> On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote:
> >>>>>
> >>>>>> The path name that appears in /proc/mounts is the key that must be used
> >>>>>> to find and unmount a filesystem. When you do that "find"ing you are
> >>>>>> not looking up a name in a filesystem, you are looking up a key in the
> >>>>>> mount table.
> >>>>>
> >>>>> No. The path name in /proc/mounts is *NOT* a key - it's a best-effort
> >>>>> attempt to describe the mountpoint. Pathname resolution does not work
> >>>>> in terms of "the longest prefix is found and we handle the rest within
> >>>>> that filesystem".
> >>>>>
> >>>>>> We could, instead, create an api that is given a mount-id (first number
> >>>>>> in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could
> >>>>>> read /proc/self/mountinfo, find the mount-id, and unmount it - all
> >>>>>> without ever doing path name lookup in the traditional sense.
> >>>>>>
> >>>>>> But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed
> >>>>>> LOOKUP_CACHED, and it only finds paths that are in the dcache, never
> >>>>>> revalidates, at most performs simple permission checks based on cached
> >>>>>> content.
> >>>>>
> >>>>> umount /proc/self/fd/42/barf/something
> >>>>>
> >>>>
> >>>> Does any of that involve talking to the server? I don't necessarily see
> >>>> a problem with doing the above. If "something" is in cache then that
> >>>> should still work.
> >>>>
> >>>> The main idea here is that we want to avoid communicating with the
> >>>> backing store during the umount(2) pathwalk.
> >>>>
> >>>>> Discuss.
> >>>>>
> >>>>> OTON, umount-by-mount-id is an interesting idea, but we'll need to decide
> >>>>> what would the right permissions be for it.
> >>>>>
> >>>>> But please, lose the "mount table is a mapping from path prefix to filesystem"
> >>>>> notion - it really, really is not. IIRC, there are systems that work that way,
> >>>>> but it's nowhere near the semantics used by any Unices, all variants of Linux
> >>>>> included.
> >>>>
> >>>> I'm not opposed to something by umount-by-mount-id either. All of this
> >>>> seems like something that should probably rely on CAP_SYS_ADMIN.
> >>>
> >>> The permission model needs to account for the fact that mount ids are
> >>> global and as such you could in principle unmount any mount in any mount
> >>> namespace. IOW, you can circumvent lookup restrictions completely.
> >>>
> >>> So we could resolve the mnt-id to an FMODE_PATH and then very roughly
> >>> with no claim to solving everything:
> >>>
> >>> may_umount_by_mnt_id(struct path *opath)
> >>> {
> >>> struct path root;
> >>> bool reachable;
> >>>
> >>> // caller in principle able to circumvent lookup restrictions
> >>> if (!may_cap_dac_readsearch())
> >>> return false;
> >>>
> >>> // caller can mount in their mountns
> >>> if (!may_mount())
> >>> return false;
> >>>
> >>> // target mount and caller in the same mountns
> >>> if (!check_mnt())
> >>> return false;
> >>>
> >>> // caller could in principle reach mount from it's root
> >>> get_fs_root(current->fs, &root);
> >>> reachable = is_path_reachable(real_mount(opath->mnt), opath->dentry, &root);
> >>> path_put(&root);
> >>>
> >>> return reachable;
> >>> }
> >>>
> >>> However, that still means that we have laxer restrictions on unmounting
> >>> by mount-id then on unmount with lookup as for lookup just having
> >>> CAP_DAC_READ_SEARCH isn't enough. Usually - at least for filesytems
> >>> without custom permission handlers - we also establish that the inode
> >>> can be mapped into the caller's idmapping.
> >>>
> >>> So that would mean that unmounting by mount-id would allow you to
> >>> unmount mounts in cases where you wouldn't with umount. That might be
> >>> fine though as that's ultimately the goal here in a way.
> >>>
> >>> One could also see a very useful feature in this where you require
> >>> capable(CAP_DAC_READ_SEARCH) and capable(CAP_SYS_ADMIN) and then allow
> >>> unmounting any mount in the system by mount-id. This would obviously be
> >>> very useful for privileged service managers but I haven't thought this
> >>> Through.
> >>
> >> That is exactly why having a separate syscall to do the lookup of the mount-id is good: it provides separation of privilege.
> >>
> >> The conversion of mount-id to an O_PATH file descriptor is just akin to a path lookup, so only needs CAP_DAC_READ_SEARCH (since you require privilege only to bypass the ACL directory read and lookup restrictions). The resulting O_PATH file descriptor has no special properties that require any further privilege.
> >>
> >> Then use that resulting file descriptor for the umount, which normally requires CAP_SYS_ADMIN.
> >
> > There's a difference between unmounting directly by providing a mount id
> > and getting an O_PATH file descriptor from a mnt-id. If you can simply
> > unmount by mount-id it's useful for users that have CAP_DAC_READ_SEARCH
> > in a container. Without it you likely need to require
> > capable(CAP_DAC_READ_SEARCH) aka system level privileges just like
> > open_to_handle_at() which makes this interface way less generic and
> > usable. Otherwise you'd be able to get an O_PATH fd to something that
> > you wouldn't be able to access through normal path lookup.
>
>
> Being able to convert into an O_PATH descriptor gives you more options
> than just unmounting. It should allow you to syncfs() before
> unmounting. It should allow you to call open_tree() so you can
> manipulate the filesystem that is no longer accessible by path walk
> (e.g. so you can bind it elsewhere or move it).

I'm not saying it's wrong. I'm just saying there are trade-offs. Sure
this is useful but it'll need to be a pretty privileged api which might
be fine.

2023-04-14 16:44:50

by Trond Myklebust

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk



> On Apr 14, 2023, at 12:22, Al Viro <[email protected]> wrote:
>
> On Fri, Apr 14, 2023 at 03:57:34PM +0000, Trond Myklebust wrote:
>
>>>
>>> Being able to convert into an O_PATH descriptor gives you more options than just unmounting. It should allow you to syncfs() before unmounting. It should allow you to call open_tree() so you can manipulate the filesystem that is no longer accessible by path walk (e.g. so you can bind it elsewhere or move it).
>>>
>>
>> One more thing it might allow us to do, which I’ve been wanting for a while in NFS: allow us to flip the mount type from being “hard” to “soft” before doing the lazy unmount, so that any application that might still retry I/O after the call to umount_begin() completes will start timing out with an I/O error, and free up the resources it might otherwise hold forever.
>>
>
> s/lazy/forced/, surely? Confused...

I mean both cases. Doing a lazy umount with a hard mounted filesystem is a risk sport: if the server does become permanently borked, you can fill up your page cache with stuff that can’t be evicted. Most users don’t realise this, so they get confused when it happens (particularly since the filesystem is out-of-sight and hence out-of-mind).

>
> Note, BTW, that hard vs. soft is a property of fs instance; if you have
> it present elsewhere in the mount tree, flipping it would affect all
> such places. I don't see any good way to make it a per-mount thing, TBH…


The main use case is for when the server is permanently down, so normally it shouldn’t be a problem with flipping the mode on all instances.

That said, it might be nice to make it per-mountpoint at some time. We do have the ability to declare individual RPC calls to time out, so it’s doable at the RPC level. All we would really need is the ability to store a per-vfsmount flag.

_________________________________
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2023-04-14 19:09:58

by Benjamin Coddington

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On 14 Apr 2023, at 12:41, Trond Myklebust wrote:
>
> I mean both cases. Doing a lazy umount with a hard mounted filesystem is a risk sport: if the server does become permanently borked, you can fill up your page cache with stuff that can’t be evicted. Most users don’t realise this, so they get confused when it happens (particularly since the filesystem is out-of-sight and hence out-of-mind).

I've been pecking away at a sysfs knob for this case. Seemed a clearer path to destruction.

>>
>> Note, BTW, that hard vs. soft is a property of fs instance; if you have
>> it present elsewhere in the mount tree, flipping it would affect all
>> such places. I don't see any good way to make it a per-mount thing, TBH…
>
>
> The main use case is for when the server is permanently down, so normally it shouldn’t be a problem with flipping the mode on all instances.

Is there another case? Because, if so..

> That said, it might be nice to make it per-mountpoint at some time. We do have the ability to declare individual RPC calls to time out, so it’s doable at the RPC level. All we would really need is the ability to store a per-vfsmount flag.

.. maybe vfsmount's mnt_root d_fsdata?

Ben

2023-04-15 09:58:50

by Amir Goldstein

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Fri, Apr 14, 2023 at 1:13 PM Jeff Layton <[email protected]> wrote:
>
> On Fri, 2023-04-14 at 11:41 +0200, Christian Brauner wrote:
> > On Fri, Apr 14, 2023 at 04:51:04AM +0100, Al Viro wrote:
> > > On Fri, Apr 14, 2023 at 03:28:45AM +0000, Trond Myklebust wrote:
> > >
> > > > We already have support for directory file descriptors when mounting with move_mount(). Why not add a umountat() with similar support for the unmount side?
> > > > Then add a syscall to allow users with (e.g.) the CAP_DAC_OVERRIDE privilege to convert the mount-id into an O_PATH file descriptor.
> > >
> > > You can already do umount -l /proc/self/fd/69 if you have a descriptor.
> >
> > Way back when we put together stuff for [2] we had umountat() as an item
> > but decided against it because it's mostely useful when used as AT_EMPTY_PATH.
> >
> > umount("/proc/self/fd/<nr>", ...) is useful when you don't trust the
> > path and you need to resolve it with lookup restrictions. Then path
> > resolution restrictions of openat2() can be used to get an fd. Which can
> > be passed to umount().
> >
> > I need to step outside so this is a halfway-out-the-door thought but
> > given your description of the problem Jeff, why doesn't the following
> > work (Just sketching this, you can't call openat2() like that.):
> >
> > fd_mnt = openat2("/my/funky/nfs/share/mount", RESOLVE_CACHED)
> > umount("/proc/self/fd/fd_mnt", MNT_DETACH)
>
> Something like that might work. A RESOLVE_CACHED flag is something that
> would involve more than just umount(2) though. That said, it could be
> useful in other situations.
>
> >
> > > Converting mount-id to O_PATH... might be an interesting idea.
> >
> > I think using mount-ids would be nice and fwiw, something we considered
> > as an alternative to umountat(). Not just can they be gotten from
> > /proc/<pid>/mountinfo but we also do expose the mount id to userspace
> > nowadays through:
> >
> > STATX_MNT_ID
> > __u64 stx_mnt_id;
> >
> > which also came out of [2]. And it should be safe to do via
> > AT_STATX_DONT_SYNC:
> >
> > statx(my_cached_fd, AT_NO_AUTOMOUNT|AT_SYMLINK_NOFOLLOW|AT_STATX_DONT_SYNC)
> >
> > using STATX_ATTR_MOUNT_ROOT to identify a potential mountpoint. Then
> > pass that mount-id to the new system call.
> >
> > [2]: https://github.com/uapi-group/kernel-features
>
> This is generating a lot of good ideas! Maybe we should plan to discuss
> this further at LSF/MM?
>

Hi Jeff,

I am trying to collect the topics for LSF/MM FS sessions, but it is somewhat
hard to do without an official [TOPIC] suggestion.

Not sure if this specific thread has anything left to discuss in a session
or if the original SUBJECT still describes the wider topic accurately.

Could you please follow up with a [TOPIC] proposal or just let me know
1. That you are interested to lead the session
2. Descriptive title for the session to put in the schedule
3. lore link to put in the schedule

While at it, please provide me with this info regarding
"i_version improvements".

Thanks,
Amir.

2023-04-16 23:38:35

by NeilBrown

[permalink] [raw]
Subject: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible.


When performing a LOOKUP_MOUNTPOINT lookup we don't really want to
engage with underlying systems at all. Any mount point MUST be in the
dcache with a complete direct path from the root to the mountpoint.
That should be sufficient to find the mountpoint given a path name.

This becomes an issue when the filesystem changes unexpected, such as
when a NFS server is changed to reject all access. It then becomes
impossible to unmount anything mounted on the filesystem which has
changed. We could simply lazy-unmount the changed filesystem and that
will often be sufficient. However if the target filesystem needs
"umount -f" to complete the unmount properly, then the lazy unmount will
leave it incompletely unmounted. When "-f" is needed, we really need to
be able to name the target filesystem.

We NEVER want to revalidate anything. We already avoid the revalidation
of the mountpoint itself, be we won't need to revalidate anything on the
path either as thay might affect the cache, and the cache is what we are
really looking in.

Permission checks are a little less clear. We currently allow any user
to make the umount syscall and perform the path lookup and only reject
unprivileged users once the target mount point has been found. If we
completely relax permission checks then an unprivileged user could probe
inaccessible parts of the name space by examining the error returned
from umount().

So we only relax permission check when the user has CAP_SYS_ADMIN
(may_mount() succeeds).

Note that if the path given is not direct and for example uses symlinks
or "..", then dentries or symlink content may not be cached and a remote
server could cause problem. We can only be certain of a safe unmount if
a direct path is used.

Signed-off-by: NeilBrown <[email protected]>
---
fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------
1 file changed, 40 insertions(+), 6 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index edfedfbccaef..f2df1adae7c5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
*
* When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
*/
-int inode_permission(struct mnt_idmap *idmap,
- struct inode *inode, int mask)
+int inode_permission_mp(struct mnt_idmap *idmap,
+ struct inode *inode, int mask, bool mp)
{
int retval;

@@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap,
return -EACCES;
}

- retval = do_inode_permission(idmap, inode, mask);
+ if (mp)
+ /* We are looking for a mountpoint and so don't
+ * want to interact with the filesystem at all, just
+ * the dcache and icache.
+ */
+ retval = generic_permission(idmap, inode, mask);
+ else
+ retval = do_inode_permission(idmap, inode, mask);
if (retval)
return retval;

@@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap,

return security_inode_permission(inode, mask);
}
+
+/**
+ * inode_permission - Check for access rights to a given inode
+ * @idmap: idmap of the mount the inode was found from
+ * @inode: Inode to check permission on
+ * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
+ *
+ * Check for read/write/execute permissions on an inode. We use fs[ug]id for
+ * this, letting us set arbitrary permissions for filesystem access without
+ * changing the "normal" UIDs which are used for other things.
+ *
+ * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
+ */
+int inode_permission(struct mnt_idmap *idmap,
+ struct inode *inode, int mask)
+{
+ return inode_permission_mp(idmap, inode, mask, false);
+}
EXPORT_SYMBOL(inode_permission);

/**
@@ -589,6 +614,7 @@ struct nameidata {
#define ND_ROOT_PRESET 1
#define ND_ROOT_GRABBED 2
#define ND_JUMPED 4
+#define ND_SYS_ADMIN 8

static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name)
{
@@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry)

static inline int d_revalidate(struct dentry *dentry, unsigned int flags)
{
- if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE))
+ if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) &&
+ likely(!(flags & LOOKUP_MOUNTPOINT)))
return dentry->d_op->d_revalidate(dentry, flags);
else
return 1;
@@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name,
static inline int may_lookup(struct mnt_idmap *idmap,
struct nameidata *nd)
{
+ /* If we are looking for a mountpoint and we have the SYS_ADMIN
+ * capability, then we will by-pass the filesys for permission checks
+ * and just use generic_permission().
+ */
+ bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN);
if (nd->flags & LOOKUP_RCU) {
- int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK);
+ int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp);
if (err != -ECHILD || !try_to_unlazy(nd))
return err;
}
- return inode_permission(idmap, nd->inode, MAY_EXEC);
+ return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp);
}

static int reserve_stack(struct nameidata *nd, struct path *link)
@@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
if (IS_ERR(name))
return PTR_ERR(name);
set_nameidata(&nd, dfd, name, root);
+ if ((flags & LOOKUP_MOUNTPOINT) && may_mount())
+ nd.state = ND_SYS_ADMIN;
retval = path_lookupat(&nd, flags | LOOKUP_RCU, path);
if (unlikely(retval == -ECHILD))
retval = path_lookupat(&nd, flags, path);
--
2.40.0

2023-04-17 08:23:44

by Christian Brauner

[permalink] [raw]
Subject: Re: allowing for a completely cached umount(2) pathwalk

On Fri, Apr 14, 2023 at 03:01:01PM -0400, Benjamin Coddington wrote:
> On 14 Apr 2023, at 12:41, Trond Myklebust wrote:
> >
> > I mean both cases. Doing a lazy umount with a hard mounted filesystem is a risk sport: if the server does become permanently borked, you can fill up your page cache with stuff that can’t be evicted. Most users don’t realise this, so they get confused when it happens (particularly since the filesystem is out-of-sight and hence out-of-mind).
>
> I've been pecking away at a sysfs knob for this case. Seemed a clearer path to destruction.
>
> >>
> >> Note, BTW, that hard vs. soft is a property of fs instance; if you have
> >> it present elsewhere in the mount tree, flipping it would affect all
> >> such places. I don't see any good way to make it a per-mount thing, TBH…
> >
> >
> > The main use case is for when the server is permanently down, so normally it shouldn’t be a problem with flipping the mode on all instances.
>
> Is there another case? Because, if so..
>
> > That said, it might be nice to make it per-mountpoint at some time.
> > We do have the ability to declare individual RPC calls to time out,
> > so it’s doable at the RPC level. All we would really need is the
> > ability to store a per-vfsmount flag.

I would very much like to avoid having filesystem specific data in
struct vfsmount. That sounds like a maintenance nightmare going forward.
Mount structures should remain pure vfs constructs that only carry
generic properties imho.

>
> .. maybe vfsmount's mnt_root d_fsdata?

I don't think that would work without having access to the vfsmount.

2023-04-17 12:09:15

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible.

On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote:
>
> When performing a LOOKUP_MOUNTPOINT lookup we don't really want to
> engage with underlying systems at all. Any mount point MUST be in the
> dcache with a complete direct path from the root to the mountpoint.
> That should be sufficient to find the mountpoint given a path name.
>
> This becomes an issue when the filesystem changes unexpected, such as
> when a NFS server is changed to reject all access. It then becomes
> impossible to unmount anything mounted on the filesystem which has
> changed. We could simply lazy-unmount the changed filesystem and that
> will often be sufficient. However if the target filesystem needs
> "umount -f" to complete the unmount properly, then the lazy unmount will
> leave it incompletely unmounted. When "-f" is needed, we really need to

I don't understand this yet. All I see is nfs_umount_begin() that's
different for MNT_FORCE to kill remaining io. Why does that preclude
MNT_DETACH? You might very well fail MNT_FORCE and the only way you can
get rid is to use MNT_DETACH, no? So I don't see why that is an
argument.

> be able to name the target filesystem.
>
> We NEVER want to revalidate anything. We already avoid the revalidation
> of the mountpoint itself, be we won't need to revalidate anything on the
> path either as thay might affect the cache, and the cache is what we are
> really looking in.
>
> Permission checks are a little less clear. We currently allow any user

This is all very brittle.

First case that comes to mind is overlayfs where the permission checking
is performed twice. Once on the overlayfs inode itself based on the
caller's security context and a second time on the underlying inode with
the security context of the mounter of the overlayfs instance.

A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so
they'd be able to mount the overlayfs instance but would be restricted
from accessing certain directories or files. The task accessing the
overlayfs instance however could have a completely different security
context including CAP_DAC_READ_SEARCH and such. Both tasks could
reasonably be in different user namespaces and so on.

The LSM hooks are also called twice and would now also be called once.

It also forgets that acl_permission() check may very well call into the
filesystem via check_acl().

So umount could either be used to infer existence of files that the user
wouldn't otherwise know they exist or in the worst case allow to umount
something that they wouldn't have access to.

Aside from that this would break userspace assumptions and as Al and
I've mentioned before in the other thread you'd need a new flag to
umount2() for this. The permission model can't just change behind users
back.

But I dislike it for the now even more special-cased umount path lookup
alone tbh. I'd feel way more comfortable with a non-lookup related
solution that doesn't have subtle implications for permission checking.

> to make the umount syscall and perform the path lookup and only reject
> unprivileged users once the target mount point has been found. If we
> completely relax permission checks then an unprivileged user could probe
> inaccessible parts of the name space by examining the error returned
> from umount().
>
> So we only relax permission check when the user has CAP_SYS_ADMIN
> (may_mount() succeeds).
>
> Note that if the path given is not direct and for example uses symlinks
> or "..", then dentries or symlink content may not be cached and a remote
> server could cause problem. We can only be certain of a safe unmount if
> a direct path is used.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 40 insertions(+), 6 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index edfedfbccaef..f2df1adae7c5 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
> *
> * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
> */
> -int inode_permission(struct mnt_idmap *idmap,
> - struct inode *inode, int mask)
> +int inode_permission_mp(struct mnt_idmap *idmap,
> + struct inode *inode, int mask, bool mp)
> {
> int retval;
>
> @@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap,
> return -EACCES;
> }
>
> - retval = do_inode_permission(idmap, inode, mask);
> + if (mp)
> + /* We are looking for a mountpoint and so don't
> + * want to interact with the filesystem at all, just
> + * the dcache and icache.
> + */
> + retval = generic_permission(idmap, inode, mask);
> + else
> + retval = do_inode_permission(idmap, inode, mask);
> if (retval)
> return retval;
>
> @@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap,
>
> return security_inode_permission(inode, mask);
> }
> +
> +/**
> + * inode_permission - Check for access rights to a given inode
> + * @idmap: idmap of the mount the inode was found from
> + * @inode: Inode to check permission on
> + * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
> + *
> + * Check for read/write/execute permissions on an inode. We use fs[ug]id for
> + * this, letting us set arbitrary permissions for filesystem access without
> + * changing the "normal" UIDs which are used for other things.
> + *
> + * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
> + */
> +int inode_permission(struct mnt_idmap *idmap,
> + struct inode *inode, int mask)
> +{
> + return inode_permission_mp(idmap, inode, mask, false);
> +}
> EXPORT_SYMBOL(inode_permission);
>
> /**
> @@ -589,6 +614,7 @@ struct nameidata {
> #define ND_ROOT_PRESET 1
> #define ND_ROOT_GRABBED 2
> #define ND_JUMPED 4
> +#define ND_SYS_ADMIN 8
>
> static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name)
> {
> @@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry)
>
> static inline int d_revalidate(struct dentry *dentry, unsigned int flags)
> {
> - if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE))
> + if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) &&
> + likely(!(flags & LOOKUP_MOUNTPOINT)))
> return dentry->d_op->d_revalidate(dentry, flags);
> else
> return 1;
> @@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name,
> static inline int may_lookup(struct mnt_idmap *idmap,
> struct nameidata *nd)
> {
> + /* If we are looking for a mountpoint and we have the SYS_ADMIN
> + * capability, then we will by-pass the filesys for permission checks
> + * and just use generic_permission().
> + */
> + bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN);
> if (nd->flags & LOOKUP_RCU) {
> - int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK);
> + int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp);
> if (err != -ECHILD || !try_to_unlazy(nd))
> return err;
> }
> - return inode_permission(idmap, nd->inode, MAY_EXEC);
> + return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp);
> }
>
> static int reserve_stack(struct nameidata *nd, struct path *link)
> @@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
> if (IS_ERR(name))
> return PTR_ERR(name);
> set_nameidata(&nd, dfd, name, root);
> + if ((flags & LOOKUP_MOUNTPOINT) && may_mount())
> + nd.state = ND_SYS_ADMIN;
> retval = path_lookupat(&nd, flags | LOOKUP_RCU, path);
> if (unlikely(retval == -ECHILD))
> retval = path_lookupat(&nd, flags, path);
> --
> 2.40.0
>

2023-04-17 12:13:10

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible.

On Mon, 2023-04-17 at 09:13 +1000, NeilBrown wrote:
> When performing a LOOKUP_MOUNTPOINT lookup we don't really want to
> engage with underlying systems at all. Any mount point MUST be in the
> dcache with a complete direct path from the root to the mountpoint.
> That should be sufficient to find the mountpoint given a path name.
>
> This becomes an issue when the filesystem changes unexpected, such as
> when a NFS server is changed to reject all access. It then becomes
> impossible to unmount anything mounted on the filesystem which has
> changed. We could simply lazy-unmount the changed filesystem and that
> will often be sufficient. However if the target filesystem needs
> "umount -f" to complete the unmount properly, then the lazy unmount will
> leave it incompletely unmounted. When "-f" is needed, we really need to
> be able to name the target filesyste
>
> We NEVER want to revalidate anything. We already avoid the revalidation
> of the mountpoint itself, be we won't need to revalidate anything on the
> path either as thay might affect the cache, and the cache is what we are
> really looking in.
>
> Permission checks are a little less clear. We currently allow any user
> to make the umount syscall and perform the path lookup and only reject
> unprivileged users once the target mount point has been found. If we
> completely relax permission checks then an unprivileged user could probe
> inaccessible parts of the name space by examining the error returned
> from umount().
>

That sounds pretty reasonable. Most umounts are done by root in some
fashion anyway.



> So we only relax permission check when the user has CAP_SYS_ADMIN
> (may_mount() succeeds).
>
> Note that if the path given is not direct and for example uses symlinks
> or "..", then dentries or symlink content may not be cached and a remote
> server could cause problem. We can only be certain of a safe unmount if
> a direct path is used.
>

I guess we do still have to allow it to do a lookup due to symlinks. I
think this is still worthwhile though since it'd fix a lot of these
cases.

> Signed-off-by: NeilBrown <[email protected]>
> ---
> fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 40 insertions(+), 6 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index edfedfbccaef..f2df1adae7c5 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
> *
> * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
> */
> -int inode_permission(struct mnt_idmap *idmap,
> - struct inode *inode, int mask)
> +int inode_permission_mp(struct mnt_idmap *idmap,
> + struct inode *inode, int mask, bool mp)
> {
> int retval;
>
> @@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap,
> return -EACCES;
> }
>
> - retval = do_inode_permission(idmap, inode, mask);
> + if (mp)
> + /* We are looking for a mountpoint and so don't
> + * want to interact with the filesystem at all, just
> + * the dcache and icache.
> + */
> + retval = generic_permission(idmap, inode, mask);
> + else
> + retval = do_inode_permission(idmap, inode, mask);
> if (retval)
> return retval;
>
> @@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap,
>
> return security_inode_permission(inode, mask);
> }
> +
> +/**
> + * inode_permission - Check for access rights to a given inode
> + * @idmap: idmap of the mount the inode was found from
> + * @inode: Inode to check permission on
> + * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
> + *
> + * Check for read/write/execute permissions on an inode. We use fs[ug]id for
> + * this, letting us set arbitrary permissions for filesystem access without
> + * changing the "normal" UIDs which are used for other things.
> + *
> + * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
> + */
> +int inode_permission(struct mnt_idmap *idmap,
> + struct inode *inode, int mask)
> +{
> + return inode_permission_mp(idmap, inode, mask, false);
> +}
> EXPORT_SYMBOL(inode_permission);
>
> /**
> @@ -589,6 +614,7 @@ struct nameidata {
> #define ND_ROOT_PRESET 1
> #define ND_ROOT_GRABBED 2
> #define ND_JUMPED 4
> +#define ND_SYS_ADMIN 8
>
> static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name)
> {
> @@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry)
>
> static inline int d_revalidate(struct dentry *dentry, unsigned int flags)
> {
> - if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE))
> + if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) &&
> + likely(!(flags & LOOKUP_MOUNTPOINT)))
> return dentry->d_op->d_revalidate(dentry, flags);
> else
> return 1;
> @@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name,
> static inline int may_lookup(struct mnt_idmap *idmap,
> struct nameidata *nd)
> {
> + /* If we are looking for a mountpoint and we have the SYS_ADMIN
> + * capability, then we will by-pass the filesys for permission checks
> + * and just use generic_permission().
> + */
> + bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN);
> if (nd->flags & LOOKUP_RCU) {
> - int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK);
> + int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp);
> if (err != -ECHILD || !try_to_unlazy(nd))
> return err;
> }
> - return inode_permission(idmap, nd->inode, MAY_EXEC);
> + return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp);
> }
>
> static int reserve_stack(struct nameidata *nd, struct path *link)
> @@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
> if (IS_ERR(name))
> return PTR_ERR(name);
> set_nameidata(&nd, dfd, name, root);
> + if ((flags & LOOKUP_MOUNTPOINT) && may_mount())
> + nd.state = ND_SYS_ADMIN;
> retval = path_lookupat(&nd, flags | LOOKUP_RCU, path);
> if (unlikely(retval == -ECHILD))
> retval = path_lookupat(&nd, flags, path);

This behavior looks right along the lines of what I was thinking.

Just for bisectability, it might be worthwhile to break this up along
conceptual lines: one patch to make it skip d_revalidate, one that
changes the permission checking, etc.

I'll plan to give this a try soon with Dave's reproducer.

Thanks!
--
Jeff Layton <[email protected]>

2023-04-17 12:35:39

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible.

On Mon, 2023-04-17 at 13:55 +0200, Christian Brauner wrote:
> On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote:
> >
> > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to
> > engage with underlying systems at all. Any mount point MUST be in the
> > dcache with a complete direct path from the root to the mountpoint.
> > That should be sufficient to find the mountpoint given a path name.
> >
> > This becomes an issue when the filesystem changes unexpected, such as
> > when a NFS server is changed to reject all access. It then becomes
> > impossible to unmount anything mounted on the filesystem which has
> > changed. We could simply lazy-unmount the changed filesystem and that
> > will often be sufficient. However if the target filesystem needs
> > "umount -f" to complete the unmount properly, then the lazy unmount will
> > leave it incompletely unmounted. When "-f" is needed, we really need to
>
> I don't understand this yet. All I see is nfs_umount_begin() that's
> different for MNT_FORCE to kill remaining io. Why does that preclude
> MNT_DETACH? You might very well fail MNT_FORCE and the only way you can
> get rid is to use MNT_DETACH, no? So I don't see why that is an
> argument.
>
> > be able to name the target filesystem.
> >
> > We NEVER want to revalidate anything. We already avoid the revalidation
> > of the mountpoint itself, be we won't need to revalidate anything on the
> > path either as thay might affect the cache, and the cache is what we are
> > really looking in.
> >
> > Permission checks are a little less clear. We currently allow any user
>
> This is all very brittle.
>
> First case that comes to mind is overlayfs where the permission checking
> is performed twice. Once on the overlayfs inode itself based on the
> caller's security context and a second time on the underlying inode with
> the security context of the mounter of the overlayfs instance.
>
> A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so
> they'd be able to mount the overlayfs instance but would be restricted
> from accessing certain directories or files. The task accessing the
> overlayfs instance however could have a completely different security
> context including CAP_DAC_READ_SEARCH and such. Both tasks could
> reasonably be in different user namespaces and so on.
>
> The LSM hooks are also called twice and would now also be called once.
>
> It also forgets that acl_permission() check may very well call into the
> filesystem via check_acl().
>
> So umount could either be used to infer existence of files that the user
> wouldn't otherwise know they exist or in the worst case allow to umount
> something that they wouldn't have access to.
>
> Aside from that this would break userspace assumptions and as Al and
> I've mentioned before in the other thread you'd need a new flag to
> umount2() for this. The permission model can't just change behind users
> back.
>
> But I dislike it for the now even more special-cased umount path lookup
> alone tbh. I'd feel way more comfortable with a non-lookup related
> solution that doesn't have subtle implications for permission checking.
>

These are good points.

One way around the issues you point out might be to pass down a new
MAY_LOOKUP_MOUNTPOINT mask flag to ->permission. That would allow the
filesystem driver to decide whether it wants to avoid potentially
problematic activity when checking permissions. nfs_permission could
check for that and take a more hands-off approach to the permissions
check. Between that and skipping d_revalidate on LOOKUP_MOUNTPOINT, I
think that might do what we need.

> > to make the umount syscall and perform the path lookup and only reject
> > unprivileged users once the target mount point has been found. If we
> > completely relax permission checks then an unprivileged user could probe
> > inaccessible parts of the name space by examining the error returned
> > from umount().
> >
> > So we only relax permission check when the user has CAP_SYS_ADMIN
> > (may_mount() succeeds).
> >
> > Note that if the path given is not direct and for example uses symlinks
> > or "..", then dentries or symlink content may not be cached and a remote
> > server could cause problem. We can only be certain of a safe unmount if
> > a direct path is used.
> >
> > Signed-off-by: NeilBrown <[email protected]>
> > ---
> > fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------
> > 1 file changed, 40 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/namei.c b/fs/namei.c
> > index edfedfbccaef..f2df1adae7c5 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
> > *
> > * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
> > */
> > -int inode_permission(struct mnt_idmap *idmap,
> > - struct inode *inode, int mask)
> > +int inode_permission_mp(struct mnt_idmap *idmap,
> > + struct inode *inode, int mask, bool mp)
> > {
> > int retval;
> >
> > @@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap,
> > return -EACCES;
> > }
> >
> > - retval = do_inode_permission(idmap, inode, mask);
> > + if (mp)
> > + /* We are looking for a mountpoint and so don't
> > + * want to interact with the filesystem at all, just
> > + * the dcache and icache.
> > + */
> > + retval = generic_permission(idmap, inode, mask);
> > + else
> > + retval = do_inode_permission(idmap, inode, mask);
> > if (retval)
> > return retval;
> >
> > @@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap,
> >
> > return security_inode_permission(inode, mask);
> > }
> > +
> > +/**
> > + * inode_permission - Check for access rights to a given inode
> > + * @idmap: idmap of the mount the inode was found from
> > + * @inode: Inode to check permission on
> > + * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
> > + *
> > + * Check for read/write/execute permissions on an inode. We use fs[ug]id for
> > + * this, letting us set arbitrary permissions for filesystem access without
> > + * changing the "normal" UIDs which are used for other things.
> > + *
> > + * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
> > + */
> > +int inode_permission(struct mnt_idmap *idmap,
> > + struct inode *inode, int mask)
> > +{
> > + return inode_permission_mp(idmap, inode, mask, false);
> > +}
> > EXPORT_SYMBOL(inode_permission);
> >
> > /**
> > @@ -589,6 +614,7 @@ struct nameidata {
> > #define ND_ROOT_PRESET 1
> > #define ND_ROOT_GRABBED 2
> > #define ND_JUMPED 4
> > +#define ND_SYS_ADMIN 8
> >
> > static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name)
> > {
> > @@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry)
> >
> > static inline int d_revalidate(struct dentry *dentry, unsigned int flags)
> > {
> > - if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE))
> > + if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) &&
> > + likely(!(flags & LOOKUP_MOUNTPOINT)))
> > return dentry->d_op->d_revalidate(dentry, flags);
> > else
> > return 1;
> > @@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name,
> > static inline int may_lookup(struct mnt_idmap *idmap,
> > struct nameidata *nd)
> > {
> > + /* If we are looking for a mountpoint and we have the SYS_ADMIN
> > + * capability, then we will by-pass the filesys for permission checks
> > + * and just use generic_permission().
> > + */
> > + bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN);
> > if (nd->flags & LOOKUP_RCU) {
> > - int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK);
> > + int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp);
> > if (err != -ECHILD || !try_to_unlazy(nd))
> > return err;
> > }
> > - return inode_permission(idmap, nd->inode, MAY_EXEC);
> > + return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp);
> > }
> >
> > static int reserve_stack(struct nameidata *nd, struct path *link)
> > @@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
> > if (IS_ERR(name))
> > return PTR_ERR(name);
> > set_nameidata(&nd, dfd, name, root);
> > + if ((flags & LOOKUP_MOUNTPOINT) && may_mount())
> > + nd.state = ND_SYS_ADMIN;
> > retval = path_lookupat(&nd, flags | LOOKUP_RCU, path);
> > if (unlikely(retval == -ECHILD))
> > retval = path_lookupat(&nd, flags, path);
> > --
> > 2.40.0
> >

--
Jeff Layton <[email protected]>

2023-04-17 14:39:14

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible.

On Mon, Apr 17, 2023 at 08:25:23AM -0400, Jeff Layton wrote:
> On Mon, 2023-04-17 at 13:55 +0200, Christian Brauner wrote:
> > On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote:
> > >
> > > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to
> > > engage with underlying systems at all. Any mount point MUST be in the
> > > dcache with a complete direct path from the root to the mountpoint.
> > > That should be sufficient to find the mountpoint given a path name.
> > >
> > > This becomes an issue when the filesystem changes unexpected, such as
> > > when a NFS server is changed to reject all access. It then becomes
> > > impossible to unmount anything mounted on the filesystem which has
> > > changed. We could simply lazy-unmount the changed filesystem and that
> > > will often be sufficient. However if the target filesystem needs
> > > "umount -f" to complete the unmount properly, then the lazy unmount will
> > > leave it incompletely unmounted. When "-f" is needed, we really need to
> >
> > I don't understand this yet. All I see is nfs_umount_begin() that's
> > different for MNT_FORCE to kill remaining io. Why does that preclude
> > MNT_DETACH? You might very well fail MNT_FORCE and the only way you can
> > get rid is to use MNT_DETACH, no? So I don't see why that is an
> > argument.
> >
> > > be able to name the target filesystem.
> > >
> > > We NEVER want to revalidate anything. We already avoid the revalidation
> > > of the mountpoint itself, be we won't need to revalidate anything on the
> > > path either as thay might affect the cache, and the cache is what we are
> > > really looking in.
> > >
> > > Permission checks are a little less clear. We currently allow any user
> >
> > This is all very brittle.
> >
> > First case that comes to mind is overlayfs where the permission checking
> > is performed twice. Once on the overlayfs inode itself based on the
> > caller's security context and a second time on the underlying inode with
> > the security context of the mounter of the overlayfs instance.
> >
> > A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so
> > they'd be able to mount the overlayfs instance but would be restricted
> > from accessing certain directories or files. The task accessing the
> > overlayfs instance however could have a completely different security
> > context including CAP_DAC_READ_SEARCH and such. Both tasks could
> > reasonably be in different user namespaces and so on.
> >
> > The LSM hooks are also called twice and would now also be called once.
> >
> > It also forgets that acl_permission() check may very well call into the
> > filesystem via check_acl().
> >
> > So umount could either be used to infer existence of files that the user
> > wouldn't otherwise know they exist or in the worst case allow to umount
> > something that they wouldn't have access to.
> >
> > Aside from that this would break userspace assumptions and as Al and
> > I've mentioned before in the other thread you'd need a new flag to
> > umount2() for this. The permission model can't just change behind users
> > back.
> >
> > But I dislike it for the now even more special-cased umount path lookup
> > alone tbh. I'd feel way more comfortable with a non-lookup related
> > solution that doesn't have subtle implications for permission checking.
> >
>
> These are good points.
>
> One way around the issues you point out might be to pass down a new
> MAY_LOOKUP_MOUNTPOINT mask flag to ->permission. That would allow the
> filesystem driver to decide whether it wants to avoid potentially
> problematic activity when checking permissions. nfs_permission could
> check for that and take a more hands-off approach to the permissions
> check. Between that and skipping d_revalidate on LOOKUP_MOUNTPOINT, I
> think that might do what we need.

Yes, that's pretty obvious. I considered that, wrote the section and
deleted it again because I still find this pretty ugly. It does leak
very specific lookup information into filesystems that isn't generically
useful like MAY_NOT_BLOCK is. Most filesystems don't need to check with
a server like NFS does and so don't suffer from this issue.

The crucial change in the patchset in its current form is that you're
requesting from the VFS to significantly alter permission checking just
because this is a umount request in a pretty fundamental way for roughly
21 filesytems. Afaict, on the VFS level that doesn't make sense. The VFS
can't just skip a filesystem's ->permission() handler with "well, it's
just on a way to a umount so whatever". That's just not going to be
correct and is just a source of subtle security bugs. So NAK on that.

And I'm curious why is it obvious that we don't want to revalidate _any_
path component and not just the last one? Why is that generally safe?
Why can't this be used to access files and directories the caller
wouldn't otherwise be able to access? I would like to have this spelled
out for slow people like me, please.

From my point of view, this would only be somewhat safe _generally_ if
you'd allow circumvention for revalidation and permission checking if
MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH).
You'd still mess with overlayfs permission model in this case though.

Plus, there are better options of solving this problem. Again, I'd
rather build a separate api for unmounting then playing such potentially
subtle security sensitive games with permission checking during path
lookup.

2023-04-17 15:23:53

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible.

On Mon, 2023-04-17 at 16:24 +0200, Christian Brauner wrote:
> On Mon, Apr 17, 2023 at 08:25:23AM -0400, Jeff Layton wrote:
> > On Mon, 2023-04-17 at 13:55 +0200, Christian Brauner wrote:
> > > On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote:
> > > >
> > > > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to
> > > > engage with underlying systems at all. Any mount point MUST be in the
> > > > dcache with a complete direct path from the root to the mountpoint.
> > > > That should be sufficient to find the mountpoint given a path name.
> > > >
> > > > This becomes an issue when the filesystem changes unexpected, such as
> > > > when a NFS server is changed to reject all access. It then becomes
> > > > impossible to unmount anything mounted on the filesystem which has
> > > > changed. We could simply lazy-unmount the changed filesystem and that
> > > > will often be sufficient. However if the target filesystem needs
> > > > "umount -f" to complete the unmount properly, then the lazy unmount will
> > > > leave it incompletely unmounted. When "-f" is needed, we really need to
> > >
> > > I don't understand this yet. All I see is nfs_umount_begin() that's
> > > different for MNT_FORCE to kill remaining io. Why does that preclude
> > > MNT_DETACH? You might very well fail MNT_FORCE and the only way you can
> > > get rid is to use MNT_DETACH, no? So I don't see why that is an
> > > argument.
> > >
> > > > be able to name the target filesystem.
> > > >
> > > > We NEVER want to revalidate anything. We already avoid the revalidation
> > > > of the mountpoint itself, be we won't need to revalidate anything on the
> > > > path either as thay might affect the cache, and the cache is what we are
> > > > really looking in.
> > > >
> > > > Permission checks are a little less clear. We currently allow any user
> > >
> > > This is all very brittle.
> > >
> > > First case that comes to mind is overlayfs where the permission checking
> > > is performed twice. Once on the overlayfs inode itself based on the
> > > caller's security context and a second time on the underlying inode with
> > > the security context of the mounter of the overlayfs instance.
> > >
> > > A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so
> > > they'd be able to mount the overlayfs instance but would be restricted
> > > from accessing certain directories or files. The task accessing the
> > > overlayfs instance however could have a completely different security
> > > context including CAP_DAC_READ_SEARCH and such. Both tasks could
> > > reasonably be in different user namespaces and so on.
> > >
> > > The LSM hooks are also called twice and would now also be called once.
> > >
> > > It also forgets that acl_permission() check may very well call into the
> > > filesystem via check_acl().
> > >
> > > So umount could either be used to infer existence of files that the user
> > > wouldn't otherwise know they exist or in the worst case allow to umount
> > > something that they wouldn't have access to.
> > >
> > > Aside from that this would break userspace assumptions and as Al and
> > > I've mentioned before in the other thread you'd need a new flag to
> > > umount2() for this. The permission model can't just change behind users
> > > back.
> > >
> > > But I dislike it for the now even more special-cased umount path lookup
> > > alone tbh. I'd feel way more comfortable with a non-lookup related
> > > solution that doesn't have subtle implications for permission checking.
> > >
> >
> > These are good points.
> >
> > One way around the issues you point out might be to pass down a new
> > MAY_LOOKUP_MOUNTPOINT mask flag to ->permission. That would allow the
> > filesystem driver to decide whether it wants to avoid potentially
> > problematic activity when checking permissions. nfs_permission could
> > check for that and take a more hands-off approach to the permissions
> > check. Between that and skipping d_revalidate on LOOKUP_MOUNTPOINT, I
> > think that might do what we need.
>
> Yes, that's pretty obvious. I considered that, wrote the section and
> deleted it again because I still find this pretty ugly. It does leak
> very specific lookup information into filesystems that isn't generically
> useful like MAY_NOT_BLOCK is. Most filesystems don't need to check with
> a server like NFS does and so don't suffer from this issue.
>

Sort of. Most of the MAY flags cover a specific operation. In this case,
we'd just be adding a flag to make it clear that this permission check
is for the specific purpose of chasing down a mountpoint (presumably to
umount).

This field does look less like a "mask" field now though, and more like
a "flags".

> The crucial change in the patchset in its current form is that you're
> requesting from the VFS to significantly alter permission checking just
> because this is a umount request in a pretty fundamental way for roughly
> 21 filesytems. Afaict, on the VFS level that doesn't make sense. The VFS
> can't just skip a filesystem's ->permission() handler with "well, it's
> just on a way to a umount so whatever". That's just not going to be
> correct and is just a source of subtle security bugs. So NAK on that.
>

Fair enough. I too think the permission check ought to be left up to the
filesystem driver. It does need some way to know that the permission
check is in the context of a LOOKUP_MOUNTPOINT pathwalk though. A MAY_*
flag seems like the obvious choice, since we could use that for checking
ACLs too, but maybe there is some better way.

> And I'm curious why is it obvious that we don't want to revalidate _any_
> path component and not just the last one? Why is that generally safe?
> Why can't this be used to access files and directories the caller
> wouldn't otherwise be able to access? I would like to have this spelled
> out for slow people like me, please.
>
> From my point of view, this would only be somewhat safe _generally_ if
> you'd allow circumvention for revalidation and permission checking if
> MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH).
> You'd still mess with overlayfs permission model in this case though.
>
> Plus, there are better options of solving this problem. Again, I'd
> rather build a separate api for unmounting then playing such potentially
> subtle security sensitive games with permission checking during path
> lookup.

umount(2) is really a special case because the whole intent is to detach
a mount from the local hierarchy and stop using it. The unfortunate bit
is that it is a path-based syscall.

So usually we have to:

- determine the path: Maybe stat() it and to validate that it's the
mountpoint we want to drop
- then call umount with that path

The last thing we want in that case is for the server to decide to
change some intermediate dentry in between the two operations. Best
case, you'll get back ENOENT or something when the pathwalk fails. Worst
case, the server swaps what are two different mountpoints on your client
and you unmount the wrong one.

If we don't revaliate, then we're no worse off, and may be better off if
something hinky happens to the server of an intermediate dentry in the
path.
--
Jeff Layton <[email protected]>

2023-04-17 21:33:53

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible.

On Mon, 17 Apr 2023, Christian Brauner wrote:
> On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote:
> >
> > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to
> > engage with underlying systems at all. Any mount point MUST be in the
> > dcache with a complete direct path from the root to the mountpoint.
> > That should be sufficient to find the mountpoint given a path name.
> >
> > This becomes an issue when the filesystem changes unexpected, such as
> > when a NFS server is changed to reject all access. It then becomes
> > impossible to unmount anything mounted on the filesystem which has
> > changed. We could simply lazy-unmount the changed filesystem and that
> > will often be sufficient. However if the target filesystem needs
> > "umount -f" to complete the unmount properly, then the lazy unmount will
> > leave it incompletely unmounted. When "-f" is needed, we really need to
>
> I don't understand this yet. All I see is nfs_umount_begin() that's
> different for MNT_FORCE to kill remaining io. Why does that preclude
> MNT_DETACH? You might very well fail MNT_FORCE and the only way you can
> get rid is to use MNT_DETACH, no? So I don't see why that is an
> argument.

Possibly I could have been more clear.
If /foo is the mount point for a filesystem that is causing problems and
/foo/bar is the mount point of a different filesystem, which might have
other problems, then I was saying that the simple solution of
umount -l /foo
which would MNT_DETACH both filesystems is not ideal because there might
be pending IO that will never complete, so the mount can never be
cleaned up. In such cases we really want to be able to do
umount -f /foo/bar
and ensure that succeeds.
You can see now that it isn't that MNT_FORCE precludes MNT_DETACH, but
they would be done on different filesystems.

MNT_FORCE is, I think, a good idea and a needed functionality that has
never been implemented well.
MNT_FORCE causes nfs_umount_begin to be called as you noted, which
aborts all pending RPCs on that filesystem.
This might be useful if you have an "ls" running, as it is likely to
abort on the first getdents() failure. But if you have "ls -l' running,
it will likely just go and try another lstat(), and that is just as
likely to hang - thus still preventing the umount.

What we used to do was find anything using the mount and kill it
(sometime "fuser -k" would work). These processes would likely be stuck
in a D wait, but "umount -f" would abort the RPCs and the processes
would get unstuck and could respond to the signal. This would sometimes
take a couple of iterations.

These days we have TASK_KILLABLE so this is much less likely to be
needed - we just kill all the processes and they die promptly. Most of
the time. There are a few places that can still block, but which no-one
has had the motivation to fix.

It would be much nicer if we didn't need to kill things. Even with
"fuser -k" it isn't really the sort of thing you want in a shutdown
script (or in systemd-shutdown).

It would be ideal if the shutdown process could call "umount" and if
that fails with EBUSY, call "umount -f" and be confident of ....
something. Currently "-f" might inspire hope, but not confidence.

We cannot realistically make MNT_FORCE truly force a umount because
processes really need to die before that happens. But we could ensure
that the filesystem is quiescent and stays that way.

Maybe we could add a flag to 'struct vfsmount' to say that "unmount has
been forced". Any attempt to use a fd with such a vfsmnt would fail,
expect close() and anything similar. Maybe nfs_umount_begin could
iterate all open files on that vfsmnt and purge any cached data so no
background writes were needed. I think this would be very close to what
we needed.

Then systemd could run umount() and if that failed then
umount(MNT_FORCE) and if that fails umount(MNT_DETACH), with confidence
that there would be not more RPCs generates, so it would be safe to turn
off the network.

BTW, that is another reason that just doing "umount -l /foo" is not
ideal. We sometimes want to wait for the umount to complete, so we can
remove the backing store or the network. "umount -l /foo" doesn't let
us wait. "umount /foo/bar" does.

>
> > be able to name the target filesystem.
> >
> > We NEVER want to revalidate anything. We already avoid the revalidation
> > of the mountpoint itself, be we won't need to revalidate anything on the
> > path either as thay might affect the cache, and the cache is what we are
> > really looking in.
> >
> > Permission checks are a little less clear. We currently allow any user
>
> This is all very brittle.
>
> First case that comes to mind is overlayfs where the permission checking
> is performed twice. Once on the overlayfs inode itself based on the
> caller's security context and a second time on the underlying inode with
> the security context of the mounter of the overlayfs instance.
>
> A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so
> they'd be able to mount the overlayfs instance but would be restricted
> from accessing certain directories or files. The task accessing the
> overlayfs instance however could have a completely different security
> context including CAP_DAC_READ_SEARCH and such. Both tasks could
> reasonably be in different user namespaces and so on.
>
> The LSM hooks are also called twice and would now also be called once.

I'd prefer "called never"

>
> It also forgets that acl_permission() check may very well call into the
> filesystem via check_acl().

I didn't pay proper attention to acl_permission() - you are right.

>
> So umount could either be used to infer existence of files that the user
> wouldn't otherwise know they exist or in the worst case allow to umount
> something that they wouldn't have access to.
>
> Aside from that this would break userspace assumptions and as Al and
> I've mentioned before in the other thread you'd need a new flag to
> umount2() for this. The permission model can't just change behind users
> back.
>
> But I dislike it for the now even more special-cased umount path lookup
> alone tbh. I'd feel way more comfortable with a non-lookup related
> solution that doesn't have subtle implications for permission checking.

I was really hoping that existing code could be made to just work.

Thanks for the review.

NeilBrown


>
> > to make the umount syscall and perform the path lookup and only reject
> > unprivileged users once the target mount point has been found. If we
> > completely relax permission checks then an unprivileged user could probe
> > inaccessible parts of the name space by examining the error returned
> > from umount().
> >
> > So we only relax permission check when the user has CAP_SYS_ADMIN
> > (may_mount() succeeds).
> >
> > Note that if the path given is not direct and for example uses symlinks
> > or "..", then dentries or symlink content may not be cached and a remote
> > server could cause problem. We can only be certain of a safe unmount if
> > a direct path is used.
> >
> > Signed-off-by: NeilBrown <[email protected]>
> > ---
> > fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------
> > 1 file changed, 40 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/namei.c b/fs/namei.c
> > index edfedfbccaef..f2df1adae7c5 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
> > *
> > * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
> > */
> > -int inode_permission(struct mnt_idmap *idmap,
> > - struct inode *inode, int mask)
> > +int inode_permission_mp(struct mnt_idmap *idmap,
> > + struct inode *inode, int mask, bool mp)
> > {
> > int retval;
> >
> > @@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap,
> > return -EACCES;
> > }
> >
> > - retval = do_inode_permission(idmap, inode, mask);
> > + if (mp)
> > + /* We are looking for a mountpoint and so don't
> > + * want to interact with the filesystem at all, just
> > + * the dcache and icache.
> > + */
> > + retval = generic_permission(idmap, inode, mask);
> > + else
> > + retval = do_inode_permission(idmap, inode, mask);
> > if (retval)
> > return retval;
> >
> > @@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap,
> >
> > return security_inode_permission(inode, mask);
> > }
> > +
> > +/**
> > + * inode_permission - Check for access rights to a given inode
> > + * @idmap: idmap of the mount the inode was found from
> > + * @inode: Inode to check permission on
> > + * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
> > + *
> > + * Check for read/write/execute permissions on an inode. We use fs[ug]id for
> > + * this, letting us set arbitrary permissions for filesystem access without
> > + * changing the "normal" UIDs which are used for other things.
> > + *
> > + * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
> > + */
> > +int inode_permission(struct mnt_idmap *idmap,
> > + struct inode *inode, int mask)
> > +{
> > + return inode_permission_mp(idmap, inode, mask, false);
> > +}
> > EXPORT_SYMBOL(inode_permission);
> >
> > /**
> > @@ -589,6 +614,7 @@ struct nameidata {
> > #define ND_ROOT_PRESET 1
> > #define ND_ROOT_GRABBED 2
> > #define ND_JUMPED 4
> > +#define ND_SYS_ADMIN 8
> >
> > static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name)
> > {
> > @@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry)
> >
> > static inline int d_revalidate(struct dentry *dentry, unsigned int flags)
> > {
> > - if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE))
> > + if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) &&
> > + likely(!(flags & LOOKUP_MOUNTPOINT)))
> > return dentry->d_op->d_revalidate(dentry, flags);
> > else
> > return 1;
> > @@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name,
> > static inline int may_lookup(struct mnt_idmap *idmap,
> > struct nameidata *nd)
> > {
> > + /* If we are looking for a mountpoint and we have the SYS_ADMIN
> > + * capability, then we will by-pass the filesys for permission checks
> > + * and just use generic_permission().
> > + */
> > + bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN);
> > if (nd->flags & LOOKUP_RCU) {
> > - int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK);
> > + int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp);
> > if (err != -ECHILD || !try_to_unlazy(nd))
> > return err;
> > }
> > - return inode_permission(idmap, nd->inode, MAY_EXEC);
> > + return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp);
> > }
> >
> > static int reserve_stack(struct nameidata *nd, struct path *link)
> > @@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags,
> > if (IS_ERR(name))
> > return PTR_ERR(name);
> > set_nameidata(&nd, dfd, name, root);
> > + if ((flags & LOOKUP_MOUNTPOINT) && may_mount())
> > + nd.state = ND_SYS_ADMIN;
> > retval = path_lookupat(&nd, flags | LOOKUP_RCU, path);
> > if (unlikely(retval == -ECHILD))
> > retval = path_lookupat(&nd, flags, path);
> > --
> > 2.40.0
> >
>

2023-04-17 21:42:34

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible.

On Tue, 18 Apr 2023, Jeff Layton wrote:
>
> The last thing we want in that case is for the server to decide to
> change some intermediate dentry in between the two operations. Best
> case, you'll get back ENOENT or something when the pathwalk fails. Worst
> case, the server swaps what are two different mountpoints on your client
> and you unmount the wrong one.

Actually, the last think I want is for the server to do something that
causes a directory to be invalidated (d_invalidate()). Then any mount
points below there get DETACHed and we lose any change to use MNT_FORCE
or to wait for the unmount to complete. Of course this can also happen
during any other access, not just umount....

>
> If we don't revaliate, then we're no worse off, and may be better off if
> something hinky happens to the server of an intermediate dentry in the
> path.

Exactly. If we don't revalidate we might use old data, but it will be
old data that were were allowed to access. It might be data that we are
not now allowed to access, but it will never be new data that were have
never been allowed to access.

Thanks,
NeilBrown

2023-04-18 03:40:33

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible.


> On Apr 17, 2023, at 9:21 AM, Jeff Layton <[email protected]> wrote:
>
> On Mon, 2023-04-17 at 16:24 +0200, Christian Brauner wrote:
>> And I'm curious why is it obvious that we don't want to revalidate _any_
>> path component and not just the last one? Why is that generally safe?
>> Why can't this be used to access files and directories the caller
>> wouldn't otherwise be able to access? I would like to have this spelled
>> out for slow people like me, please.
>>
>> From my point of view, this would only be somewhat safe _generally_ if
>> you'd allow circumvention for revalidation and permission checking if
>> MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH).
>> You'd still mess with overlayfs permission model in this case though.
>>
>> Plus, there are better options of solving this problem. Again, I'd
>> rather build a separate api for unmounting then playing such potentially
>> subtle security sensitive games with permission checking during path
>> lookup.
>
> umount(2) is really a special case because the whole intent is to detach
> a mount from the local hierarchy and stop using it. The unfortunate bit
> is that it is a path-based syscall.
>
> So usually we have to:
>
> - determine the path: Maybe stat() it and to validate that it's the
> mountpoint we want to drop

The stat() itself may hang because a remote server, or USB stick is
inaccessible or having media errors.

I've just been having a conversation with Karel Zak to change
umount(1) to use statx() so that it interacts minimally with the fs.

In particular, nfs_getattr() skips revalidate if only minimal attrs
are fetched (STATX_TYPE | STATX_INO), and also skips revalidate if
locally-cached attrs are still valid (STATX_MODE), so this will
avoid yet one more place that unmount can hang.

In theory, vfs_getattr() could get all of these attributes directly
from the vfs_inode in the unmount case.

> - then call umount with that path
>
> The last thing we want in that case is for the server to decide to
> change some intermediate dentry in between the two operations. Best
> case, you'll get back ENOENT or something when the pathwalk fails. Worst
> case, the server swaps what are two different mountpoints on your client
> and you unmount the wrong one.
>
> If we don't revaliate, then we're no worse off, and may be better off if
> something hinky happens to the server of an intermediate dentry in the
> path.
> --
> Jeff Layton <[email protected]>


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2023-04-18 08:06:53

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible.

On Mon, Apr 17, 2023 at 09:25:20PM -0600, Andreas Dilger wrote:
>
> > On Apr 17, 2023, at 9:21 AM, Jeff Layton <[email protected]> wrote:
> >
> > On Mon, 2023-04-17 at 16:24 +0200, Christian Brauner wrote:
> >> And I'm curious why is it obvious that we don't want to revalidate _any_
> >> path component and not just the last one? Why is that generally safe?
> >> Why can't this be used to access files and directories the caller
> >> wouldn't otherwise be able to access? I would like to have this spelled
> >> out for slow people like me, please.
> >>
> >> From my point of view, this would only be somewhat safe _generally_ if
> >> you'd allow circumvention for revalidation and permission checking if
> >> MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH).
> >> You'd still mess with overlayfs permission model in this case though.
> >>
> >> Plus, there are better options of solving this problem. Again, I'd
> >> rather build a separate api for unmounting then playing such potentially
> >> subtle security sensitive games with permission checking during path
> >> lookup.
> >
> > umount(2) is really a special case because the whole intent is to detach
> > a mount from the local hierarchy and stop using it. The unfortunate bit
> > is that it is a path-based syscall.
> >
> > So usually we have to:
> >
> > - determine the path: Maybe stat() it and to validate that it's the
> > mountpoint we want to drop
>
> The stat() itself may hang because a remote server, or USB stick is
> inaccessible or having media errors.
>
> I've just been having a conversation with Karel Zak to change
> umount(1) to use statx() so that it interacts minimally with the fs.

So we're talking about this commit:
https://github.com/util-linux/util-linux/commit/42e141d20505a0deb969c2e583a463c26aadc62f
and the patch we discussed here:
https://github.com/util-linux/util-linux/issues/2049

>
> In particular, nfs_getattr() skips revalidate if only minimal attrs
> are fetched (STATX_TYPE | STATX_INO), and also skips revalidate if
> locally-cached attrs are still valid (STATX_MODE), so this will
> avoid yet one more place that unmount can hang.
>
> In theory, vfs_getattr() could get all of these attributes directly
> from the vfs_inode in the unmount case.

We don't really need that. As pointed out in that discussion and as
Karel did we just want to pass AT_STATX_DONT_SYNC.

An api that would allow unmounting by mount id can safely check and
retrieve AT_STATX_DONT_SYNC with STATX_ATTR_MOUNT_ROOT and STATX_MNT_ID
without ever syncing with the server.

2023-04-18 08:33:46

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible.

On Tue, Apr 18, 2023 at 07:34:14AM +1000, NeilBrown wrote:
> On Tue, 18 Apr 2023, Jeff Layton wrote:
> >
> > The last thing we want in that case is for the server to decide to
> > change some intermediate dentry in between the two operations. Best
> > case, you'll get back ENOENT or something when the pathwalk fails. Worst
> > case, the server swaps what are two different mountpoints on your client
> > and you unmount the wrong one.
>
> Actually, the last think I want is for the server to do something that
> causes a directory to be invalidated (d_invalidate()). Then any mount
> points below there get DETACHed and we lose any change to use MNT_FORCE
> or to wait for the unmount to complete. Of course this can also happen
> during any other access, not just umount....

Any rmdir/unlink from another mount namespace where the mountpoint isn't
a local mountpoint would lazily unmount the whole mount tree. You can't
guard against this anyway.

2023-04-20 13:07:58

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible.

On Tue, 2023-04-18 at 10:04 +0200, Christian Brauner wrote:
> On Mon, Apr 17, 2023 at 09:25:20PM -0600, Andreas Dilger wrote:
> >
> > > On Apr 17, 2023, at 9:21 AM, Jeff Layton <[email protected]> wrote:
> > >
> > > On Mon, 2023-04-17 at 16:24 +0200, Christian Brauner wrote:
> > > > And I'm curious why is it obvious that we don't want to revalidate _any_
> > > > path component and not just the last one? Why is that generally safe?
> > > > Why can't this be used to access files and directories the caller
> > > > wouldn't otherwise be able to access? I would like to have this spelled
> > > > out for slow people like me, please.
> > > >
> > > > From my point of view, this would only be somewhat safe _generally_ if
> > > > you'd allow circumvention for revalidation and permission checking if
> > > > MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH).
> > > > You'd still mess with overlayfs permission model in this case though.
> > > >
> > > > Plus, there are better options of solving this problem. Again, I'd
> > > > rather build a separate api for unmounting then playing such potentially
> > > > subtle security sensitive games with permission checking during path
> > > > lookup.
> > >
> > > umount(2) is really a special case because the whole intent is to detach
> > > a mount from the local hierarchy and stop using it. The unfortunate bit
> > > is that it is a path-based syscall.
> > >
> > > So usually we have to:
> > >
> > > - determine the path: Maybe stat() it and to validate that it's the
> > > mountpoint we want to drop
> >
> > The stat() itself may hang because a remote server, or USB stick is
> > inaccessible or having media errors.
> >
> > I've just been having a conversation with Karel Zak to change
> > umount(1) to use statx() so that it interacts minimally with the fs.
>
> So we're talking about this commit:
> https://github.com/util-linux/util-linux/commit/42e141d20505a0deb969c2e583a463c26aadc62f
> and the patch we discussed here:
> https://github.com/util-linux/util-linux/issues/2049
>
> >
> > In particular, nfs_getattr() skips revalidate if only minimal attrs
> > are fetched (STATX_TYPE | STATX_INO), and also skips revalidate if
> > locally-cached attrs are still valid (STATX_MODE), so this will
> > avoid yet one more place that unmount can hang.
> >
> > In theory, vfs_getattr() could get all of these attributes directly
> > from the vfs_inode in the unmount case.
>
> We don't really need that. As pointed out in that discussion and as
> Karel did we just want to pass AT_STATX_DONT_SYNC.
>
> An api that would allow unmounting by mount id can safely check and
> retrieve AT_STATX_DONT_SYNC with STATX_ATTR_MOUNT_ROOT and STATX_MNT_ID
> without ever syncing with the server.

Unfortunately, I don't think that flag trickles down to the ->permission
checks for the pathwalk down to the point you're stat'ing. That turns
out to be a big part of the problem when trying to umount things that
are stuck down in inaccessible trees.

I'm not opposed at all to new APIs that allow for more reliable
unmounting. I think that's a good idea, and something worth hashing out.

But...we're stuck with umount() in perpetuity. Even if you were to merge
a new API for unmounting today, it'll be a decade or more until the
ecosystem catches up. I think we have a responsibility to make the
umount syscall work as well as we can. In this case, that means giving
it as light a footprint as possible on the pathwalk down.

If we need to gate this behavior behind existing or new flags to
umount2(), then so be it, but lets not dismiss this idea in favor of
some new unmounting interface that may never materialize. Improving
umount has the potential to help a lot of our users.

--
Jeff Layton <[email protected]>

2023-04-20 15:43:06

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible.

On Thu, Apr 20, 2023 at 09:05:47AM -0400, Jeff Layton wrote:
> On Tue, 2023-04-18 at 10:04 +0200, Christian Brauner wrote:
> > On Mon, Apr 17, 2023 at 09:25:20PM -0600, Andreas Dilger wrote:
> > >
> > > > On Apr 17, 2023, at 9:21 AM, Jeff Layton <[email protected]> wrote:
> > > >
> > > > On Mon, 2023-04-17 at 16:24 +0200, Christian Brauner wrote:
> > > > > And I'm curious why is it obvious that we don't want to revalidate _any_
> > > > > path component and not just the last one? Why is that generally safe?
> > > > > Why can't this be used to access files and directories the caller
> > > > > wouldn't otherwise be able to access? I would like to have this spelled
> > > > > out for slow people like me, please.
> > > > >
> > > > > From my point of view, this would only be somewhat safe _generally_ if
> > > > > you'd allow circumvention for revalidation and permission checking if
> > > > > MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH).
> > > > > You'd still mess with overlayfs permission model in this case though.
> > > > >
> > > > > Plus, there are better options of solving this problem. Again, I'd
> > > > > rather build a separate api for unmounting then playing such potentially
> > > > > subtle security sensitive games with permission checking during path
> > > > > lookup.
> > > >
> > > > umount(2) is really a special case because the whole intent is to detach
> > > > a mount from the local hierarchy and stop using it. The unfortunate bit
> > > > is that it is a path-based syscall.
> > > >
> > > > So usually we have to:
> > > >
> > > > - determine the path: Maybe stat() it and to validate that it's the
> > > > mountpoint we want to drop
> > >
> > > The stat() itself may hang because a remote server, or USB stick is
> > > inaccessible or having media errors.
> > >
> > > I've just been having a conversation with Karel Zak to change
> > > umount(1) to use statx() so that it interacts minimally with the fs.
> >
> > So we're talking about this commit:
> > https://github.com/util-linux/util-linux/commit/42e141d20505a0deb969c2e583a463c26aadc62f
> > and the patch we discussed here:
> > https://github.com/util-linux/util-linux/issues/2049
> >
> > >
> > > In particular, nfs_getattr() skips revalidate if only minimal attrs
> > > are fetched (STATX_TYPE | STATX_INO), and also skips revalidate if
> > > locally-cached attrs are still valid (STATX_MODE), so this will
> > > avoid yet one more place that unmount can hang.
> > >
> > > In theory, vfs_getattr() could get all of these attributes directly
> > > from the vfs_inode in the unmount case.
> >
> > We don't really need that. As pointed out in that discussion and as
> > Karel did we just want to pass AT_STATX_DONT_SYNC.
> >
> > An api that would allow unmounting by mount id can safely check and
> > retrieve AT_STATX_DONT_SYNC with STATX_ATTR_MOUNT_ROOT and STATX_MNT_ID
> > without ever syncing with the server.
>
> Unfortunately, I don't think that flag trickles down to the ->permission
> checks for the pathwalk down to the point you're stat'ing. That turns
> out to be a big part of the problem when trying to umount things that
> are stuck down in inaccessible trees.
>
> I'm not opposed at all to new APIs that allow for more reliable
> unmounting. I think that's a good idea, and something worth hashing out.
>
> But...we're stuck with umount() in perpetuity. Even if you were to merge
> a new API for unmounting today, it'll be a decade or more until the
> ecosystem catches up. I think we have a responsibility to make the
> umount syscall work as well as we can. In this case, that means giving
> it as light a footprint as possible on the pathwalk down.
>
> If we need to gate this behavior behind existing or new flags to
> umount2(), then so be it, but lets not dismiss this idea in favor of
> some new unmounting interface that may never materialize. Improving
> umount has the potential to help a lot of our users.

A new flag for an old system call or a new system call doesn't matter in
practice for userspace. And the users that have that specific problem
that's solved by a new interface will use that interface asap. That's
been true for the pidfd api, that's been true for openat2(), clone3()
and others.

Plus, this is a workload specific problem.

And even with or without a new flag it'd still need to be backported to
old kernels.

And just because an interface already exists doesn't mean stuff should
be shoehorned into it just because it's convenient or faster. That's
just asking for maintenance pain down the road.

And from this discussion there's multiple ways to work around this issue
currently so I especially don't see the need to rush any of this and
fiddle around with permission checking.

2023-04-20 21:40:51

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible.

On Tue, Apr 18, 2023 at 07:26:34AM +1000, NeilBrown wrote:

> MNT_FORCE is, I think, a good idea and a needed functionality that has
> never been implemented well.
> MNT_FORCE causes nfs_umount_begin to be called as you noted, which
> aborts all pending RPCs on that filesystem.

Suppose it happens to be mounted in another namespace as well. Or bound
at different mountpoint, for that matter. What should MNT_FORCE do?

2023-04-20 22:07:15

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible.

On Fri, 21 Apr 2023, Al Viro wrote:
> On Tue, Apr 18, 2023 at 07:26:34AM +1000, NeilBrown wrote:
>
> > MNT_FORCE is, I think, a good idea and a needed functionality that has
> > never been implemented well.
> > MNT_FORCE causes nfs_umount_begin to be called as you noted, which
> > aborts all pending RPCs on that filesystem.
>
> Suppose it happens to be mounted in another namespace as well. Or bound
> at different mountpoint, for that matter. What should MNT_FORCE do?
>

1/ set a "forced-unmount" flag on the vfs_mount which causes any syscall
that uses the vfsmount (whether from an fd, or found in a path walk,
or elsewhere), except for close(), to abort with an error;
2/ call ->umount_begin passing in the vfs_mount. The fs can abort any
outstanding transaction on any fd from that vfs_mount. Possibly
it might instead abort a wait rather than the whole transaction,
particularly if requests using some other vfs_mount might also be
interested in the transaction
3/ ->close() on a force-unmount vfs_mount would clean up without
blocking indefinitely, discarding dirty data if necessary.

This still depends on the application to close all fds that return
errors (and to chdir out of a problematic directory). But at least it
*allows* applications to do that without requiring that they be killed.

Thanks,
NeilBrown

2023-04-20 22:32:16

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible.

On Fri, Apr 21, 2023 at 08:01:09AM +1000, NeilBrown wrote:
> On Fri, 21 Apr 2023, Al Viro wrote:
> > On Tue, Apr 18, 2023 at 07:26:34AM +1000, NeilBrown wrote:
> >
> > > MNT_FORCE is, I think, a good idea and a needed functionality that has
> > > never been implemented well.
> > > MNT_FORCE causes nfs_umount_begin to be called as you noted, which
> > > aborts all pending RPCs on that filesystem.
> >
> > Suppose it happens to be mounted in another namespace as well. Or bound
> > at different mountpoint, for that matter. What should MNT_FORCE do?
> >
>
> 1/ set a "forced-unmount" flag on the vfs_mount which causes any syscall
> that uses the vfsmount (whether from an fd, or found in a path walk,
> or elsewhere), except for close(), to abort with an error;
> 2/ call ->umount_begin passing in the vfs_mount. The fs can abort any
> outstanding transaction on any fd from that vfs_mount.

How the hell would you recognize those?