LinuxLists.cc - RFC: umount() fails on an NFS mount ("/A/B") mounted on an underlying ("/A") NFS mount when access is removed to the underlying mount

2023-04-06 13:34:43

Subject: RFC: umount() fails on an NFS mount ("/A/B") mounted on an underlying ("/A") NFS mount when access is removed to the underlying mount

Anna, Trond, Jeff, Neil, others,

This is a RFC post regarding a problem report from a customer. I am
not sure this is worth fixing though (look towards the end for a few
approaches), and maybe a more experienced NFS developer would have
seen this before and have guidance if this is something worth
pursuing? I think this is likely to have been a long outstanding
issue - the reproducer below works on upstream 6.3-rc5, our RHEL8
series kernels (4.18 based), as well as RHEL7 series (3.10 based)
kernels.

We had a customer report a failure to remove an autofs mountpoint that
was mounted on an underlying NFS mount whose access changed on the NFS
server. Unfortunately the customer was unable to provide the exact
steps of what changed, and was unable to provide a full tcpdump or
kernel trace. However, they did provide the specific 'umount' output
and error, which was unusual (see below), as well as output such as
exports and paths to the underlying mounts. Based on the information
the customer provided, we investigated and were able to replicate the
unique umount error, with a simplified reproducer that does not
involve autofs. The reproducer is attached and output is as follows:
# ./test-non-autofs.sh
setting exports available
exporting 127.0.0.1:/exports/dir1
exporting 127.0.0.1:/exports
setting exports unavailable
exporting 1.2.3.4:/exports/dir1
exporting 1.2.3.4:/exports
sleeping 60s to let attribute cache expire
ls: cannot access '/mnt/exports/dir1': Permission denied
umount.nfs4: /mnt/exports/dir1: block devices not permitted on fs
TEST FAIL on 6.3.0-rc5-bz2149406+
output of 'grep 127.0.0.1 /proc/mounts'
127.0.0.1:/ /mnt/exports nfs4
rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1
0 0
127.0.0.1:/dir1 /mnt/exports/dir1 nfs4
rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1
0 0
exporting 127.0.0.1:/exports/dir1
exporting 127.0.0.1:/exports
#

From ftracing the reproducer (attached), the reason the umount
systemcall fails is due to link_path_walk() failing, which is due to
access being removed on the dependent mountpoint and the NFS attribute
cache expiring. The link_path_walk() will call nfs_permission(), and
ultimately nfs_revalidate_inode() because the attributes have expired.
The nfs_revalidate_inode() will fail due to GETATTR call getting an
NFS server response of AUTH_ERR, which gets sent back up to
user_path_at() with -EACCESS. Since user_path_at() fails, the system
call never gets to path_umount() and thus umount fails.

1915 static int ksys_umount(char __user *name, int flags)
1916 {
1917 int lookup_flags = LOOKUP_MOUNTPOINT;
1918 struct path path;
1919 int ret;
1920
1921 // basic validity checks done first
1922 if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE |
UMOUNT_NOFOLLOW))
1923 return -EINVAL;
1924
1925 if (!(flags & UMOUNT_NOFOLLOW))
1926 lookup_flags |= LOOKUP_FOLLOW;
1927 ret = user_path_at(AT_FDCWD, name, lookup_flags, &path);
<--- this fails with -13 (EACCESS)
1928 if (ret)
1929 return ret;
1930 return path_umount(&path, flags);
1931 }
1932
1933 SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
1934 {
1935 return ksys_umount(name, flags);
1936 }

To fix this, I initially tried to add/reuse an existing mount flag
just as a proof of concept, and fail "user_path_at()" with a special
error code if the flag was set, and continue to path_umount().
However, this approach does not work (kernel oops) because
path_umount() requires a valid path structure, and so user_path_at()
must be successful.

The only other approach I thought about is to somehow pass down the
umount flag all the way down to the NFS layer to nfs_permission(), and
then essentially have nfs_permission() skip over the call to
nfs_revalidate_inode() - the flag would essentially say "there is a
umount in progress, act as though the attributes have not expired on
this directory inode and skip over refreshing". Unfortunately it
looks like I'll need multiple flags at different layers, one that
controls lookup in the VFS layer, and one down to the NFS layer. It's
possible some flags such as LOOKUP_MOUNTPOINT may be re-used, but I'm
not very optimistic about the idea of patching the VFS layer for such
a problem. So in the end, I'm not sure if this is worthwhile, and
would like some feedback on the above before investigating further.

Note that this is the same "cascading mount" configuration that was
reported recently in another recent thread [1] and described in the
patch [2] that fixed a similar problem:
[1] https://lore.kernel.org/linux-nfs/CACH9xG8-tEtWstUVmD9eZFEEAqx-E8Gs14wDL+=uNtBK=-KJvQ@mail.gmail.com/
[2] https://github.com/torvalds/linux/commit/cc89684c9a265828ce061037f1f79f4a68ccd3f7

Attachments:

ftrace-output-6.3.0-rc5-bz2149406+.txt (22.88 kB)
test-non-autofs.sh (2.00 kB)
Download all attachments

2023-04-12 13:40:59

by Jeffrey Layton

[permalink] [raw]

Subject: Re: RFC: umount() fails on an NFS mount ("/A/B") mounted on an underlying ("/A") NFS mount when access is removed to the underlying mount

On Thu, 2023-04-06 at 09:22 -0400, David Wysochanski wrote:
> Anna, Trond, Jeff, Neil, others,
>
> This is a RFC post regarding a problem report from a customer. I am
> not sure this is worth fixing though (look towards the end for a few
> approaches), and maybe a more experienced NFS developer would have
> seen this before and have guidance if this is something worth
> pursuing? I think this is likely to have been a long outstanding
> issue - the reproducer below works on upstream 6.3-rc5, our RHEL8
> series kernels (4.18 based), as well as RHEL7 series (3.10 based)
> kernels.
>
> We had a customer report a failure to remove an autofs mountpoint that
> was mounted on an underlying NFS mount whose access changed on the NFS
> server. Unfortunately the customer was unable to provide the exact
> steps of what changed, and was unable to provide a full tcpdump or
> kernel trace. However, they did provide the specific 'umount' output
> and error, which was unusual (see below), as well as output such as
> exports and paths to the underlying mounts. Based on the information
> the customer provided, we investigated and were able to replicate the
> unique umount error, with a simplified reproducer that does not
> involve autofs. The reproducer is attached and output is as follows:
> # ./test-non-autofs.sh
> setting exports available
> exporting 127.0.0.1:/exports/dir1
> exporting 127.0.0.1:/exports
> setting exports unavailable
> exporting 1.2.3.4:/exports/dir1
> exporting 1.2.3.4:/exports
> sleeping 60s to let attribute cache expire
> ls: cannot access '/mnt/exports/dir1': Permission denied
> umount.nfs4: /mnt/exports/dir1: block devices not permitted on fs
> TEST FAIL on 6.3.0-rc5-bz2149406+
> output of 'grep 127.0.0.1 /proc/mounts'
> 127.0.0.1:/ /mnt/exports nfs4
> rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1
> 0 0
> 127.0.0.1:/dir1 /mnt/exports/dir1 nfs4
> rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1
> 0 0
> exporting 127.0.0.1:/exports/dir1
> exporting 127.0.0.1:/exports
> #
>
> From ftracing the reproducer (attached), the reason the umount
> systemcall fails is due to link_path_walk() failing, which is due to
> access being removed on the dependent mountpoint and the NFS attribute
> cache expiring. The link_path_walk() will call nfs_permission(), and
> ultimately nfs_revalidate_inode() because the attributes have expired.
> The nfs_revalidate_inode() will fail due to GETATTR call getting an
> NFS server response of AUTH_ERR, which gets sent back up to
> user_path_at() with -EACCESS. Since user_path_at() fails, the system
> call never gets to path_umount() and thus umount fails.
>
> 1915 static int ksys_umount(char __user *name, int flags)
> 1916 {
> 1917 int lookup_flags = LOOKUP_MOUNTPOINT;
> 1918 struct path path;
> 1919 int ret;
> 1920
> 1921 // basic validity checks done first
> 1922 if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE |
> UMOUNT_NOFOLLOW))
> 1923 return -EINVAL;
> 1924
> 1925 if (!(flags & UMOUNT_NOFOLLOW))
> 1926 lookup_flags |= LOOKUP_FOLLOW;
> 1927 ret = user_path_at(AT_FDCWD, name, lookup_flags, &path);
> <--- this fails with -13 (EACCESS)
> 1928 if (ret)
> 1929 return ret;
> 1930 return path_umount(&path, flags);
> 1931 }
> 1932
> 1933 SYSCALL_DEFINE2(umount, char __user *, name, int, flags)
> 1934 {
> 1935 return ksys_umount(name, flags);
> 1936 }
>
> To fix this, I initially tried to add/reuse an existing mount flag
> just as a proof of concept, and fail "user_path_at()" with a special
> error code if the flag was set, and continue to path_umount().
> However, this approach does not work (kernel oops) because
> path_umount() requires a valid path structure, and so user_path_at()
> must be successful.
>
> The only other approach I thought about is to somehow pass down the
> umount flag all the way down to the NFS layer to nfs_permission(), and
> then essentially have nfs_permission() skip over the call to
> nfs_revalidate_inode() - the flag would essentially say "there is a
> umount in progress, act as though the attributes have not expired on
> this directory inode and skip over refreshing". Unfortunately it
> looks like I'll need multiple flags at different layers, one that
> controls lookup in the VFS layer, and one down to the NFS layer. It's
> possible some flags such as LOOKUP_MOUNTPOINT may be re-used, but I'm
> not very optimistic about the idea of patching the VFS layer for such
> a problem. So in the end, I'm not sure if this is worthwhile, and
> would like some feedback on the above before investigating further.
>
> Note that this is the same "cascading mount" configuration that was
> reported recently in another recent thread [1] and described in the
> patch [2] that fixed a similar problem:
> [1] https://lore.kernel.org/linux-nfs/CACH9xG8-tEtWstUVmD9eZFEEAqx-E8Gs14wDL+=uNtBK=-KJvQ@mail.gmail.com/
> [2] https://github.com/torvalds/linux/commit/cc89684c9a265828ce061037f1f79f4a68ccd3f7

Yeah, that's an ugly problem, alright.

Several years ago, we did some work [1] that made umount() skip
revalidation of the leaf dentry. The fs/namei.c code has changed a lot
since then, but I think that the exception still doesn't apply to non-
leaf dentries. They still get revalidated as usual on umount, and of
course are subject to permission checks during the pathwalk.

At the time, the rationale for skipping revalidation of the leaf was
that umount is really a special case, and that all we care about is the
final path component on the local machine. We're disposing of it anyway,
so we really don't care about revalidating the inode at all.

Maybe we could extend that argument to cover all of the path components
during umount? With an umount, what we're really interested in is the
local directory tree at the time that it's called, and any revalidation
of the path at that time is not adding any value.

It _might_ even be reasonable to avoid permission checks on directories
during umount too? Umount is already a privileged operation, so I'm not
sure we gain anything from doing NFS permission checks on intermediate
directories. Again, we're interested in the local directory tree with
umount.

How we'd go about implementing this, I'm not sure though. Maybe we
should widen this discussion to include linux-fsdevel? This might even
be good LSF/MM discussion fodder.

[1]: 8033426e6bdb vfs: allow umount to handle mountpoints without revalidating them
--
Jeff Layton <[email protected]>