2020-11-28 22:25:53

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v3 00/38] idmapped mounts

Hey everyone,

/* v3 */
- The major change is the port of the test-suite from the
kernel-internal selftests framework to xfstests as requested by
Darrick and Christoph. The test-suite for xfstests is patch 38 in this
series. It has been kept as part of this series even though it belongs
to xfstests so it's easier to see what is tested and to keep it
in-sync.
- Note, the test-suite now has been extended to cover io_uring and
idmapped mounts. The IORING_REGISTER_PERSONALITY feature allows to
register the caller's credentials with io_uring and returns an id
associated with these credentials. This is useful for applications
that wish to share a ring between separate users/processes. Callers
can pass in the credential id in the sqe personality field. If set,
that particular sqe will be issued with these credentials.
The test-suite now tests that the openat* operations with different
registered credentials work correctly and safely on regular mounts, on
regular mounts inside user namespaces, on idmapped mounts, and on
idmapped mounts inside user namespaces.

/* v2 */
- The major change is the rework requested by Christoph and others to
adapt all relevant helpers and inode_operations methods to account for
idmapped mounts instead of introducing new helpers and methods
specific to idmapped mounts like we did before. We've also moved the
overlayfs conversion to handle idmapped mounts into a separate
patchset that will be sent out separately after the core changes
landed. The converted filesytems in this series include fat and ext4.
As per Christoph's request the vfs-wide config option to disable
idmapped mounts has been removed. Instead the filesystems can decide
whether or not they want to allow idmap mounts through a config
option. These config options default to off. Having a config option
allows us to gain some confidence in the patchset over multiple kernel
releases.
- This version introduces a large test-suite to test current vfs
behavior and idmapped mounts behavior. This test-suite is intended to
grow over time.
- While while working on adapting this patchset to the requested
changes, the runC and containerd crowd was nice enough to adapt
containerd to this patchset to make use of idmapped mounts in one of
the most widely used container runtimes:
https://github.com/containerd/containerd/pull/4734

With this patchset we make it possible to attach idmappings to mounts.
This handles several common use-cases. Here are just a few:
- Shifting of a container rootfs or base image without having to mangle
every file (runc, Docker, containerd, k8s, LXD, systemd ...)
- Sharing of data between host or privileged containers with
underprivileged containers (runc, Docker, containerd, k8s, LXD, ...)
- Shifting of subset of ownership-less filesystems (vfat) for use by
multiple users, effectively allowing for DAC on such devices (systemd,
Android, ...)
- Data sharing between multiple user namespaces with incompatible maps
(LXD, k8s, ...)
Making it possible to share directories and mounts between users with
different uids and gids is itself quite an important use-case in
distributed systems environments. It's of course especially useful in
general for portable usb sticks, sharing data between multiple users in
general, and sharing home directories between multiple users. The last
example is now elegantly expressed in systemd's homed concept for
portable home directories. As mentioned above, idmapped mounts also
allow data from the host to be shared with unprivileged containers,
between privileged and unprivileged containers simultaneously and in
addition also between unprivileged containers with different idmappings
whenever they are used to isolate one container completely from another
container.
As can be seen from answers to earlier threads of this patchset and from
the list of potential users interest in this patchset is widespread.

We have implemented and proposed multiple solutions to this before. This
included the introduction of fsid mappings, a tiny filesystem that is
currently carried in Ubuntu that has shown it's limitations, and an
approach to call override creds in the vfs. None of these solutions have
covered all of the above use-cases. Some of them have been fairly hacky
too by e.g. violating how things should be passed down to the individual
filesystems.
The solution proposed here has it's origins in multiple discussions
during Linux Plumbers 2017 during and after the end of the containers
microconference.
To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
James, and myself. The original idea or a variant thereof has been
discussed, again to the best of my knowledge, after a Linux conference
in St. Petersburg in Russia in 2017 between Christoph, Tycho, and
myself.
We've taken the time to implement a working version of this solution
over the last weeks to the best of my abilities. Tycho has signed up
for this sligthly crazy endeavour as well and he has helped with the
conversion of the xattr codepaths and will be involved with others in
converting additional filesystems.

Idmappings become a property of struct vfsmount instead of tying it to a
process being inside of a user namespace which has been the case for all
other proposed approaches. It also allows to pass down the user
namespace into the filesystems which is a clean way instead of violating
calling conventions by strapping the user namespace information that is
a property of the mount to the caller's credentials or similar hacks.
Each mount can have a separate idmapping and idmapped mounts can even be
created in the initial user namespace unblocking a range of use-cases.

To this end the vfsmount struct gains a new struct user_namespace
member. The idmapping of the user namespace becomes the idmapping of the
mount. A caller that is privileged with respect to the user namespace of
the superblock of the underlying filesystem can create an idmapped
mount. In the future, we can enable unprivileged use-cases by checking
whether the caller is privileged wrt to the user namespace than an
already idmapped mount has been marked with, allowing them to change the
idmapping. For now, keep things simple until we feel sure enough.
Note, that with syscall interception it is already possible to intercept
idmapped mount requests from unprivileged containers and handle them in
a sufficiently privileged container manager. Support for this is already
available in LXD and will be available in runC were syscall interception
is currently in the process of becoming part of the runtime spec:
https://github.com/opencontainers/runtime-spec/pull/1074.

The user namespace the mount will be marked with can be specified by
passing a file descriptor refering to the user namespace as an argument
to the new mount_setattr() syscall together with the new
MOUNT_ATTR_IDMAP flag. By default vfsmounts are marked with the initial
user namespace and no behavioral or performance changes should be
observed. All mapping operations are nops for the initial user
namespace. When a file/inode is accessed through an idmapped mount the
i_uid and i_gid of the inode will be remapped according to the user
namespace the mount has been marked with.

In order to support idmapped mounts, filesystems need to be changed and
mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The initial
version contains fat and ext4 including a list of examples. But patches
for other filesystems are actively worked on but will be sent out
separately. We are here to see this through and there are multiple
people involved in converting filesystems. So filesystem developers are
not left alone with this.

There is a simple tool available at
https://github.com/brauner/mount-idmapped that allows to create idmapped
mounts so people can play with this patch series. Here are a few
illustrations:

1. Create a simple idmapped mount of another user's home directory

u1001@f2-vm:/$ sudo ./mount-idmapped --map-mount b:1000:1001:1 /home/ubuntu/ /mnt
u1001@f2-vm:/$ ls -al /home/ubuntu/
total 28
drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
drwxr-xr-x 4 root root 4096 Oct 28 04:00 ..
-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile
-rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ ls -al /mnt/
total 28
drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 .
drwxr-xr-x 29 root root 4096 Oct 28 22:01 ..
-rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile
-rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ touch /mnt/my-file
u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file
u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file
u1001@f2-vm:/$ ls -al /mnt/my-file
-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file
u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file
u1001@f2-vm:/$ getfacl /mnt/my-file
getfacl: Removing leading '/' from absolute path names
# file: mnt/my-file
# owner: u1001
# group: u1001
user::rw-
user:u1001:rwx
group::rw-
mask::rwx
other::r--
u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
getfacl: Removing leading '/' from absolute path names
# file: home/ubuntu/my-file
# owner: ubuntu
# group: ubuntu
user::rw-
user:ubuntu:rwx
group::rw-
mask::rwx
other::r--

2. Create mapping of the whole ext4 rootfs without a mapping for uid and gid 0

ubuntu@f2-vm:~$ sudo /mount-idmapped --map-mount b:1:1:65536 / /mnt/
ubuntu@f2-vm:~$ findmnt | grep mnt
└─/mnt /dev/sda2 ext4 rw,relatime
└─/mnt/mnt /dev/sda2 ext4 rw,relatime
ubuntu@f2-vm:~$ sudo mkdir /AS-ROOT-CAN-CREATE
ubuntu@f2-vm:~$ sudo mkdir /mnt/AS-ROOT-CANT-CREATE
mkdir: cannot create directory ‘/mnt/AS-ROOT-CANT-CREATE’: Value too large for defined data type
ubuntu@f2-vm:~$ mkdir /mnt/home/ubuntu/AS-USER-1000-CAN-CREATE

3. Create a vfat usb mount and expose to user 1001 and 5000

ubuntu@f2-vm:/$ sudo mount /dev/sdb /mnt
ubuntu@f2-vm:/$ findmnt | grep mnt
└─/mnt /dev/sdb vfat rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro
ubuntu@f2-vm:/$ ls -al /mnt
total 12
drwxr-xr-x 2 root root 4096 Jan 1 1970 .
drwxr-xr-x 34 root root 4096 Oct 28 22:24 ..
-rwxr-xr-x 1 root root 4 Oct 28 03:44 aaa
-rwxr-xr-x 1 root root 0 Oct 28 01:09 bbb
ubuntu@f2-vm:/$ sudo /mount-idmapped --map-mount b:0:1001:1 /mnt /mnt-1001/
ubuntu@f2-vm:/$ ls -al /mnt-1001/
total 12
drwxr-xr-x 2 u1001 u1001 4096 Jan 1 1970 .
drwxr-xr-x 34 root root 4096 Oct 28 22:24 ..
-rwxr-xr-x 1 u1001 u1001 4 Oct 28 03:44 aaa
-rwxr-xr-x 1 u1001 u1001 0 Oct 28 01:09 bbb
ubuntu@f2-vm:/$ sudo /mount-idmapped --map-mount b:0:5000:1 /mnt /mnt-5000/
ubuntu@f2-vm:/$ ls -al /mnt-5000/
total 12
drwxr-xr-x 2 5000 5000 4096 Jan 1 1970 .
drwxr-xr-x 34 root root 4096 Oct 28 22:24 ..
-rwxr-xr-x 1 5000 5000 4 Oct 28 03:44 aaa
-rwxr-xr-x 1 5000 5000 0 Oct 28 01:09 bbb

4. Create an idmapped rootfs mount for a container

root@f2-vm:~# ls -al /var/lib/lxc/f2/rootfs/
total 68
drwxr-xr-x 17 20000 20000 4096 Sep 24 07:48 .
drwxrwx--- 3 20000 20000 4096 Oct 16 19:26 ..
lrwxrwxrwx 1 20000 20000 7 Sep 24 07:43 bin -> usr/bin
drwxr-xr-x 2 20000 20000 4096 Apr 15 2020 boot
drwxr-xr-x 3 20000 20000 4096 Oct 16 19:26 dev
drwxr-xr-x 61 20000 20000 4096 Oct 16 19:26 etc
drwxr-xr-x 3 20000 20000 4096 Sep 24 07:45 home
lrwxrwxrwx 1 20000 20000 7 Sep 24 07:43 lib -> usr/lib
lrwxrwxrwx 1 20000 20000 9 Sep 24 07:43 lib32 -> usr/lib32
lrwxrwxrwx 1 20000 20000 9 Sep 24 07:43 lib64 -> usr/lib64
lrwxrwxrwx 1 20000 20000 10 Sep 24 07:43 libx32 -> usr/libx32
drwxr-xr-x 2 20000 20000 4096 Sep 24 07:43 media
drwxr-xr-x 2 20000 20000 4096 Sep 24 07:43 mnt
drwxr-xr-x 2 20000 20000 4096 Sep 24 07:43 opt
drwxr-xr-x 2 20000 20000 4096 Apr 15 2020 proc
drwx------ 2 20000 20000 4096 Sep 24 07:43 root
drwxr-xr-x 2 20000 20000 4096 Sep 24 07:45 run
lrwxrwxrwx 1 20000 20000 8 Sep 24 07:43 sbin -> usr/sbin
drwxr-xr-x 2 20000 20000 4096 Sep 24 07:43 srv
drwxr-xr-x 2 20000 20000 4096 Apr 15 2020 sys
drwxrwxrwt 2 20000 20000 4096 Sep 24 07:44 tmp
drwxr-xr-x 13 20000 20000 4096 Sep 24 07:43 usr
drwxr-xr-x 12 20000 20000 4096 Sep 24 07:44 var
root@f2-vm:~# /mount-idmapped --map-mount b:20000:10000:100000 /var/lib/lxc/f2/rootfs/ /mnt
root@f2-vm:~# ls -al /mnt
total 68
drwxr-xr-x 17 10000 10000 4096 Sep 24 07:48 .
drwxr-xr-x 34 root root 4096 Oct 28 22:24 ..
lrwxrwxrwx 1 10000 10000 7 Sep 24 07:43 bin -> usr/bin
drwxr-xr-x 2 10000 10000 4096 Apr 15 2020 boot
drwxr-xr-x 3 10000 10000 4096 Oct 16 19:26 dev
drwxr-xr-x 61 10000 10000 4096 Oct 16 19:26 etc
drwxr-xr-x 3 10000 10000 4096 Sep 24 07:45 home
lrwxrwxrwx 1 10000 10000 7 Sep 24 07:43 lib -> usr/lib
lrwxrwxrwx 1 10000 10000 9 Sep 24 07:43 lib32 -> usr/lib32
lrwxrwxrwx 1 10000 10000 9 Sep 24 07:43 lib64 -> usr/lib64
lrwxrwxrwx 1 10000 10000 10 Sep 24 07:43 libx32 -> usr/libx32
drwxr-xr-x 2 10000 10000 4096 Sep 24 07:43 media
drwxr-xr-x 2 10000 10000 4096 Sep 24 07:43 mnt
drwxr-xr-x 2 10000 10000 4096 Sep 24 07:43 opt
drwxr-xr-x 2 10000 10000 4096 Apr 15 2020 proc
drwx------ 2 10000 10000 4096 Sep 24 07:43 root
drwxr-xr-x 2 10000 10000 4096 Sep 24 07:45 run
lrwxrwxrwx 1 10000 10000 8 Sep 24 07:43 sbin -> usr/sbin
drwxr-xr-x 2 10000 10000 4096 Sep 24 07:43 srv
drwxr-xr-x 2 10000 10000 4096 Apr 15 2020 sys
drwxrwxrwt 2 10000 10000 4096 Sep 24 07:44 tmp
drwxr-xr-x 13 10000 10000 4096 Sep 24 07:43 usr
drwxr-xr-x 12 10000 10000 4096 Sep 24 07:44 var
root@f2-vm:~# lxc-start f2 # uses /mnt as rootfs
root@f2-vm:~# lxc-attach f2 -- cat /proc/1/uid_map
0 10000 10000
root@f2-vm:~# lxc-attach f2 -- cat /proc/1/gid_map
0 10000 10000
root@f2-vm:~# lxc-attach f2 -- ls -al /
total 52
drwxr-xr-x 17 root root 4096 Sep 24 07:48 .
drwxr-xr-x 17 root root 4096 Sep 24 07:48 ..
lrwxrwxrwx 1 root root 7 Sep 24 07:43 bin -> usr/bin
drwxr-xr-x 2 root root 4096 Apr 15 2020 boot
drwxr-xr-x 5 root root 500 Oct 28 23:39 dev
drwxr-xr-x 61 root root 4096 Oct 28 23:39 etc
drwxr-xr-x 3 root root 4096 Sep 24 07:45 home
lrwxrwxrwx 1 root root 7 Sep 24 07:43 lib -> usr/lib
lrwxrwxrwx 1 root root 9 Sep 24 07:43 lib32 -> usr/lib32
lrwxrwxrwx 1 root root 9 Sep 24 07:43 lib64 -> usr/lib64
lrwxrwxrwx 1 root root 10 Sep 24 07:43 libx32 -> usr/libx32
drwxr-xr-x 2 root root 4096 Sep 24 07:43 media
drwxr-xr-x 2 root root 4096 Sep 24 07:43 mnt
drwxr-xr-x 2 root root 4096 Sep 24 07:43 opt
dr-xr-xr-x 232 nobody nogroup 0 Oct 28 23:39 proc
drwx------ 2 root root 4096 Oct 28 23:41 root
drwxr-xr-x 12 root root 360 Oct 28 23:39 run
lrwxrwxrwx 1 root root 8 Sep 24 07:43 sbin -> usr/sbin
drwxr-xr-x 2 root root 4096 Sep 24 07:43 srv
dr-xr-xr-x 13 nobody nogroup 0 Oct 28 23:39 sys
drwxrwxrwt 11 root root 4096 Oct 28 23:40 tmp
drwxr-xr-x 13 root root 4096 Sep 24 07:43 usr
drwxr-xr-x 12 root root 4096 Sep 24 07:44 var
root@f2-vm:~# lxc-attach f2 -- ls -al /my-file
-rw-r--r-- 1 root root 0 Oct 28 23:43 /my-file
root@f2-vm:~# ls -al /var/lib/lxc/f2/rootfs/my-file
-rw-r--r-- 1 20000 20000 0 Oct 28 23:43 /var/lib/lxc/f2/rootfs/my-file

I'd like to say thanks to:
Al for pointing me into the direction to avoid inode alias issues during
lookup. David for various discussions around this. Christoph for proving
a first proper review and for being involved in the original idea. Tycho
for helping with this series and on future patches to convert
filesystems. Alban Crequy and the Kinvolk located just a few streets
away from me in Berlin for providing use-case discussions and writing
patches for containerd! Stéphane for his invaluable input on many things
and level head and enabling me to work on this. Amir for explaining and
discussing aspects of overlayfs with me. I'd like to especially thank
Seth Forshee because he provided a lot of good analysis, suggestions,
and participated in short-notice discussions in both chat and video for
some nitty-gritty technical details.

This series can be found and pulled from the three usual locations:
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=idmapped_mounts
https://github.com/brauner/linux/tree/idmapped_mounts
https://gitlab.com/brauner/linux/-/commits/idmapped_mounts

Thanks!
Christian

Christian Brauner (36):
namespace: take lock_mount_hash() directly when changing flags
mount: make {lock,unlock}_mount_hash() static
namespace: only take read lock in do_reconfigure_mnt()
fs: add mount_setattr()
tests: add mount_setattr() selftests
fs: add id translation helpers
mount: attach mappings to mounts
capability: handle idmapped mounts
namei: add idmapped mount aware permission helpers
inode: add idmapped mount aware init and permission helpers
attr: handle idmapped mounts
acl: handle idmapped mounts
commoncap: handle idmapped mounts
stat: handle idmapped mounts
namei: handle idmapped mounts in may_*() helpers
namei: introduce struct renamedata
namei: prepare for idmapped mounts
open: handle idmapped mounts in do_truncate()
open: handle idmapped mounts
af_unix: handle idmapped mounts
utimes: handle idmapped mounts
fcntl: handle idmapped mounts
notify: handle idmapped mounts
init: handle idmapped mounts
ioctl: handle idmapped mounts
would_dump: handle idmapped mounts
exec: handle idmapped mounts
fs: add helpers for idmap mounts
apparmor: handle idmapped mounts
ima: handle idmapped mounts
fat: handle idmapped mounts
ext4: support idmapped mounts
ecryptfs: do not mount on top of idmapped mounts
overlayfs: do not mount on top of idmapped mounts
fs: introduce MOUNT_ATTR_IDMAP
tests: extend mount_setattr tests

Tycho Andersen (1):
xattr: handle idmapped mounts

Documentation/filesystems/locking.rst | 6 +-
Documentation/filesystems/porting.rst | 2 +
Documentation/filesystems/vfs.rst | 17 +-
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/platforms/cell/spufs/inode.c | 5 +-
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
drivers/android/binderfs.c | 6 +-
drivers/base/devtmpfs.c | 12 +-
fs/9p/acl.c | 7 +-
fs/9p/v9fs.h | 3 +-
fs/9p/v9fs_vfs.h | 2 +-
fs/9p/vfs_inode.c | 32 +-
fs/9p/vfs_inode_dotl.c | 34 +-
fs/9p/xattr.c | 1 +
fs/adfs/adfs.h | 3 +-
fs/adfs/inode.c | 5 +-
fs/affs/affs.h | 10 +-
fs/affs/inode.c | 7 +-
fs/affs/namei.c | 15 +-
fs/afs/dir.c | 34 +-
fs/afs/inode.c | 5 +-
fs/afs/internal.h | 4 +-
fs/afs/security.c | 2 +-
fs/afs/xattr.c | 2 +
fs/attr.c | 82 +-
fs/autofs/root.c | 13 +-
fs/bad_inode.c | 31 +-
fs/bfs/dir.c | 12 +-
fs/btrfs/acl.c | 5 +-
fs/btrfs/ctree.h | 3 +-
fs/btrfs/inode.c | 41 +-
fs/btrfs/ioctl.c | 25 +-
fs/btrfs/tests/btrfs-tests.c | 2 +-
fs/btrfs/xattr.c | 2 +
fs/cachefiles/interface.c | 4 +-
fs/cachefiles/namei.c | 19 +-
fs/cachefiles/xattr.c | 16 +-
fs/ceph/acl.c | 5 +-
fs/ceph/dir.c | 23 +-
fs/ceph/inode.c | 13 +-
fs/ceph/super.h | 8 +-
fs/ceph/xattr.c | 1 +
fs/cifs/cifsfs.c | 4 +-
fs/cifs/cifsfs.h | 18 +-
fs/cifs/dir.c | 8 +-
fs/cifs/inode.c | 22 +-
fs/cifs/link.c | 3 +-
fs/cifs/xattr.c | 1 +
fs/coda/coda_linux.h | 4 +-
fs/coda/dir.c | 18 +-
fs/coda/inode.c | 5 +-
fs/coda/pioctl.c | 6 +-
fs/configfs/configfs_internal.h | 7 +-
fs/configfs/dir.c | 3 +-
fs/configfs/inode.c | 5 +-
fs/configfs/symlink.c | 5 +-
fs/coredump.c | 12 +-
fs/crypto/policy.c | 2 +-
fs/debugfs/inode.c | 9 +-
fs/ecryptfs/crypto.c | 4 +-
fs/ecryptfs/inode.c | 74 +-
fs/ecryptfs/main.c | 6 +
fs/ecryptfs/mmap.c | 4 +-
fs/efivarfs/file.c | 2 +-
fs/efivarfs/inode.c | 4 +-
fs/erofs/inode.c | 2 +-
fs/exec.c | 12 +-
fs/exfat/exfat_fs.h | 3 +-
fs/exfat/file.c | 9 +-
fs/exfat/namei.c | 13 +-
fs/ext2/acl.c | 5 +-
fs/ext2/acl.h | 3 +-
fs/ext2/ext2.h | 2 +-
fs/ext2/ialloc.c | 2 +-
fs/ext2/inode.c | 11 +-
fs/ext2/ioctl.c | 6 +-
fs/ext2/namei.c | 22 +-
fs/ext2/xattr_security.c | 1 +
fs/ext2/xattr_trusted.c | 1 +
fs/ext2/xattr_user.c | 1 +
fs/ext4/Kconfig | 9 +
fs/ext4/acl.c | 5 +-
fs/ext4/acl.h | 3 +-
fs/ext4/ext4.h | 15 +-
fs/ext4/ialloc.c | 7 +-
fs/ext4/inode.c | 14 +-
fs/ext4/ioctl.c | 18 +-
fs/ext4/namei.c | 52 +-
fs/ext4/super.c | 6 +-
fs/ext4/xattr_hurd.c | 1 +
fs/ext4/xattr_security.c | 1 +
fs/ext4/xattr_trusted.c | 1 +
fs/ext4/xattr_user.c | 1 +
fs/f2fs/acl.c | 5 +-
fs/f2fs/acl.h | 3 +-
fs/f2fs/f2fs.h | 3 +-
fs/f2fs/file.c | 31 +-
fs/f2fs/namei.c | 26 +-
fs/f2fs/xattr.c | 4 +-
fs/fat/fat.h | 3 +-
fs/fat/file.c | 20 +-
fs/fat/namei_msdos.c | 15 +-
fs/fat/namei_vfat.c | 15 +-
fs/fcntl.c | 3 +-
fs/fuse/acl.c | 3 +-
fs/fuse/dir.c | 41 +-
fs/fuse/fuse_i.h | 4 +-
fs/fuse/xattr.c | 2 +
fs/gfs2/acl.c | 5 +-
fs/gfs2/acl.h | 3 +-
fs/gfs2/file.c | 4 +-
fs/gfs2/inode.c | 54 +-
fs/gfs2/inode.h | 2 +-
fs/gfs2/xattr.c | 1 +
fs/hfs/attr.c | 1 +
fs/hfs/dir.c | 13 +-
fs/hfs/hfs_fs.h | 2 +-
fs/hfs/inode.c | 7 +-
fs/hfsplus/dir.c | 25 +-
fs/hfsplus/inode.c | 11 +-
fs/hfsplus/ioctl.c | 2 +-
fs/hfsplus/xattr.c | 1 +
fs/hfsplus/xattr_security.c | 1 +
fs/hfsplus/xattr_trusted.c | 1 +
fs/hfsplus/xattr_user.c | 1 +
fs/hostfs/hostfs_kern.c | 31 +-
fs/hpfs/hpfs_fn.h | 2 +-
fs/hpfs/inode.c | 7 +-
fs/hpfs/namei.c | 20 +-
fs/hugetlbfs/inode.c | 31 +-
fs/init.c | 23 +-
fs/inode.c | 38 +-
fs/internal.h | 9 +
fs/jffs2/acl.c | 5 +-
fs/jffs2/acl.h | 3 +-
fs/jffs2/dir.c | 32 +-
fs/jffs2/fs.c | 7 +-
fs/jffs2/os-linux.h | 2 +-
fs/jffs2/security.c | 1 +
fs/jffs2/xattr_trusted.c | 1 +
fs/jffs2/xattr_user.c | 1 +
fs/jfs/acl.c | 5 +-
fs/jfs/file.c | 9 +-
fs/jfs/ioctl.c | 2 +-
fs/jfs/jfs_acl.h | 3 +-
fs/jfs/jfs_inode.c | 2 +-
fs/jfs/jfs_inode.h | 2 +-
fs/jfs/namei.c | 21 +-
fs/jfs/xattr.c | 2 +
fs/kernfs/dir.c | 7 +-
fs/kernfs/inode.c | 15 +-
fs/kernfs/kernfs-internal.h | 5 +-
fs/libfs.c | 20 +-
fs/minix/bitmap.c | 2 +-
fs/minix/file.c | 7 +-
fs/minix/inode.c | 2 +-
fs/minix/namei.c | 25 +-
fs/mount.h | 10 -
fs/namei.c | 303 ++--
fs/namespace.c | 478 +++++-
fs/nfs/dir.c | 23 +-
fs/nfs/inode.c | 5 +-
fs/nfs/internal.h | 10 +-
fs/nfs/namespace.c | 7 +-
fs/nfs/nfs3_fs.h | 3 +-
fs/nfs/nfs3acl.c | 3 +-
fs/nfs/nfs4proc.c | 3 +
fs/nfsd/nfs2acl.c | 4 +-
fs/nfsd/nfs3acl.c | 4 +-
fs/nfsd/nfs4acl.c | 4 +-
fs/nfsd/nfs4recover.c | 6 +-
fs/nfsd/nfsfh.c | 2 +-
fs/nfsd/nfsproc.c | 2 +-
fs/nfsd/vfs.c | 47 +-
fs/nilfs2/inode.c | 13 +-
fs/nilfs2/ioctl.c | 2 +-
fs/nilfs2/namei.c | 20 +-
fs/nilfs2/nilfs.h | 4 +-
fs/notify/fanotify/fanotify_user.c | 2 +-
fs/notify/inotify/inotify_user.c | 3 +-
fs/ntfs/inode.c | 6 +-
fs/ntfs/inode.h | 3 +-
fs/ocfs2/acl.c | 5 +-
fs/ocfs2/acl.h | 3 +-
fs/ocfs2/dlmfs/dlmfs.c | 17 +-
fs/ocfs2/file.c | 13 +-
fs/ocfs2/file.h | 5 +-
fs/ocfs2/ioctl.c | 2 +-
fs/ocfs2/namei.c | 21 +-
fs/ocfs2/refcounttree.c | 4 +-
fs/ocfs2/xattr.c | 3 +
fs/omfs/dir.c | 13 +-
fs/omfs/file.c | 7 +-
fs/omfs/inode.c | 2 +-
fs/open.c | 50 +-
fs/orangefs/acl.c | 5 +-
fs/orangefs/inode.c | 15 +-
fs/orangefs/namei.c | 12 +-
fs/orangefs/orangefs-kernel.h | 7 +-
fs/orangefs/xattr.c | 1 +
fs/overlayfs/copy_up.c | 20 +-
fs/overlayfs/dir.c | 31 +-
fs/overlayfs/file.c | 6 +-
fs/overlayfs/inode.c | 21 +-
fs/overlayfs/overlayfs.h | 40 +-
fs/overlayfs/super.c | 19 +-
fs/overlayfs/util.c | 4 +-
fs/posix_acl.c | 78 +-
fs/proc/base.c | 24 +-
fs/proc/fd.c | 4 +-
fs/proc/fd.h | 3 +-
fs/proc/generic.c | 9 +-
fs/proc/internal.h | 2 +-
fs/proc/proc_net.c | 2 +-
fs/proc/proc_sysctl.c | 12 +-
fs/proc/root.c | 2 +-
fs/proc_namespace.c | 3 +
fs/ramfs/file-nommu.c | 9 +-
fs/ramfs/inode.c | 18 +-
fs/reiserfs/acl.h | 3 +-
fs/reiserfs/inode.c | 7 +-
fs/reiserfs/ioctl.c | 4 +-
fs/reiserfs/namei.c | 21 +-
fs/reiserfs/reiserfs.h | 3 +-
fs/reiserfs/xattr.c | 12 +-
fs/reiserfs/xattr.h | 2 +-
fs/reiserfs/xattr_acl.c | 7 +-
fs/reiserfs/xattr_security.c | 3 +-
fs/reiserfs/xattr_trusted.c | 3 +-
fs/reiserfs/xattr_user.c | 3 +-
fs/remap_range.c | 7 +-
fs/stat.c | 10 +-
fs/sysv/file.c | 7 +-
fs/sysv/ialloc.c | 2 +-
fs/sysv/itree.c | 2 +-
fs/sysv/namei.c | 21 +-
fs/tracefs/inode.c | 4 +-
fs/ubifs/dir.c | 29 +-
fs/ubifs/file.c | 5 +-
fs/ubifs/ioctl.c | 2 +-
fs/ubifs/ubifs.h | 3 +-
fs/ubifs/xattr.c | 1 +
fs/udf/file.c | 9 +-
fs/udf/ialloc.c | 2 +-
fs/udf/namei.c | 24 +-
fs/udf/symlink.c | 2 +-
fs/ufs/ialloc.c | 2 +-
fs/ufs/inode.c | 7 +-
fs/ufs/namei.c | 19 +-
fs/ufs/ufs.h | 3 +-
fs/utimes.c | 4 +-
fs/vboxsf/dir.c | 12 +-
fs/vboxsf/utils.c | 5 +-
fs/vboxsf/vfsmod.h | 3 +-
fs/verity/enable.c | 2 +-
fs/xattr.c | 136 +-
fs/xfs/xfs_acl.c | 5 +-
fs/xfs/xfs_acl.h | 3 +-
fs/xfs/xfs_ioctl.c | 4 +-
fs/xfs/xfs_iops.c | 55 +-
fs/xfs/xfs_xattr.c | 3 +-
fs/zonefs/super.c | 9 +-
include/linux/capability.h | 14 +-
include/linux/fs.h | 145 +-
include/linux/ima.h | 15 +-
include/linux/lsm_hook_defs.h | 15 +-
include/linux/lsm_hooks.h | 1 +
include/linux/mount.h | 7 +
include/linux/nfs_fs.h | 4 +-
include/linux/posix_acl.h | 15 +-
include/linux/posix_acl_xattr.h | 12 +-
include/linux/security.h | 44 +-
include/linux/syscalls.h | 3 +
include/linux/xattr.h | 30 +-
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/mount.h | 27 +
ipc/mqueue.c | 8 +-
kernel/auditsc.c | 5 +-
kernel/bpf/inode.c | 13 +-
kernel/capability.c | 14 +-
kernel/cgroup/cgroup.c | 2 +-
kernel/sys.c | 2 +-
mm/madvise.c | 4 +-
mm/memcontrol.c | 2 +-
mm/mincore.c | 4 +-
mm/shmem.c | 45 +-
net/socket.c | 6 +-
net/unix/af_unix.c | 4 +-
security/apparmor/apparmorfs.c | 3 +-
security/apparmor/domain.c | 13 +-
security/apparmor/file.c | 5 +-
security/apparmor/lsm.c | 12 +-
security/commoncap.c | 46 +-
security/integrity/evm/evm_crypto.c | 11 +-
security/integrity/evm/evm_main.c | 4 +-
security/integrity/evm/evm_secfs.c | 2 +-
security/integrity/ima/ima.h | 19 +-
security/integrity/ima/ima_api.c | 10 +-
security/integrity/ima/ima_appraise.c | 22 +-
security/integrity/ima/ima_asymmetric_keys.c | 2 +-
security/integrity/ima/ima_main.c | 28 +-
security/integrity/ima/ima_policy.c | 17 +-
security/integrity/ima/ima_queue_keys.c | 2 +-
security/security.c | 25 +-
security/selinux/hooks.c | 22 +-
security/smack/smack_lsm.c | 18 +-
tools/include/uapi/asm-generic/unistd.h | 4 +-
tools/testing/selftests/Makefile | 1 +
.../selftests/mount_setattr/.gitignore | 1 +
.../testing/selftests/mount_setattr/Makefile | 7 +
tools/testing/selftests/mount_setattr/config | 1 +
.../mount_setattr/mount_setattr_test.c | 1424 +++++++++++++++++
327 files changed, 4165 insertions(+), 1551 deletions(-)
create mode 100644 tools/testing/selftests/mount_setattr/.gitignore
create mode 100644 tools/testing/selftests/mount_setattr/Makefile
create mode 100644 tools/testing/selftests/mount_setattr/config
create mode 100644 tools/testing/selftests/mount_setattr/mount_setattr_test.c


base-commit: 09162bc32c880a791c6c0668ce0745cf7958f576
--
2.29.2


2020-11-28 22:25:58

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v3 04/38] fs: add mount_setattr()

This implements the missing mount_setattr() syscall. While the new mount api
allows to change the properties of a superblock there is currently no way to
change the properties of a mount or a mount tree using file descriptors which
the new mount api is based on. In addition the old mount api has the restriction
that mount options cannot be applied recursively. This hasn't changed since
changing mount options on a per-mount basis was implemented in [1] and has been
a frequent request not just for convenience but also for security reasons.
The legacy mount syscall is unable to accommodate this behavior without
introducing a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND |
MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost mount.
Changing MS_REC to apply to the whole mount tree would mean introducing a
significant uapi change and would likely cause significant regressions.

The new mount_setattr() syscall allows to recursively clear and set mount
options in one shot. Multiple calls to change mount options requesting the same
changes are idempotent:

int mount_setattr(int dfd, const char *path, unsigned flags,
struct mount_attr *uattr, size_t usize);

Flags to modify path resolution behavior are specified in the @flags argument.
Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW, and AT_NO_AUTOMOUNT
are supported. If useful, additional lookup flags to restrict path resolution as
introduced with openat2() might be supported in the future.

The mount_setattr() syscall can be expected to grow over time and is designed
with extensibility in mind. It follows the extensible syscall pattern we have
used with other syscalls such as openat2(), clone3(), sched_{set,get}attr(), and
others.
The set of mount options is passed in the uapi struct mount_attr which currently
has the following layout:

struct mount_attr {
__u64 attr_set;
__u64 attr_clr;
__u32 propagation;
};

The @attr_set and @attr_clr members are used to clear and set mount options.
This way a user can e.g. request that a set of flags is to be raised such as
turning mounts readonly by raising MOUNT_ATTR_RDONLY in @attr_set while at the
same time requesting that another set of flags is to be lowered such as removing
noexec from a mount tree by specifying MOUNT_ATTR_NOEXEC in @attr_clr.

Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0, not a
bitmap, users wanting to transition to a different atime setting cannot simply
specify the atime setting in @attr_set, but must also specify MOUNT_ATTR__ATIME
in the @attr_clr field. So we ensure that MOUNT_ATTR__ATIME can't be partially
set in @attr_clr and that @attr_set can't have any atime bits set if
MOUNT_ATTR__ATIME isn't set in @attr_clr.

The @propagation field lets callers specify the propagation type of a mount
tree. Propagation is a single property that has four different settings and as
such is not really a flag argument but an enum. Specifically, it would be
unclear what setting and clearing propagation settings in combination would
amount to. The legacy mount() syscall thus forbids the combination of multiple
propagation settings too. The goal is to keep the semantics of mount propagation
somewhat simple as they are overly complex as it is.

[1]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
Cc: Christoph Hellwig <[email protected]>
Cc: David Howells <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Al Viro <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
- Christoph Hellwig <[email protected]>:
- Split into multiple helpers.

/* v3 */
- kernel test robot <[email protected]>:
- Fix unknown __u64 type by including linux/types.h in linux/mount.h.
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/internal.h | 8 +
fs/namespace.c | 327 ++++++++++++++++++--
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/mount.h | 24 ++
tools/include/uapi/asm-generic/unistd.h | 4 +-
23 files changed, 362 insertions(+), 26 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index ee7b01bb7346..24d8709624b8 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -480,3 +480,4 @@
548 common pidfd_getfd sys_pidfd_getfd
549 common faccessat2 sys_faccessat2
550 common process_madvise sys_process_madvise
+551 common mount_setattr sys_mount_setattr
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index d056a548358e..e3785513d445 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -454,3 +454,4 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common process_madvise sys_process_madvise
+441 common mount_setattr sys_mount_setattr
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 107f08e03b9f..78af754e070a 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -889,6 +889,8 @@ __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
__SYSCALL(__NR_faccessat2, sys_faccessat2)
#define __NR_process_madvise 440
__SYSCALL(__NR_process_madvise, sys_process_madvise)
+#define __NR_mount_setattr 441
+__SYSCALL(__NR_mount_setattr, sys_mount_setattr)

/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index b96ed8b8a508..f7d4b1f55be0 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -361,3 +361,4 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common process_madvise sys_process_madvise
+441 common mount_setattr sys_mount_setattr
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 625fb6d32842..e96e9c6a6ffa 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -440,3 +440,4 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common process_madvise sys_process_madvise
+441 common mount_setattr sys_mount_setattr
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index aae729c95cf9..6538f075a18e 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -446,3 +446,4 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common process_madvise sys_process_madvise
+441 common mount_setattr sys_mount_setattr
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 32817c954435..64d129db1aa7 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -379,3 +379,4 @@
438 n32 pidfd_getfd sys_pidfd_getfd
439 n32 faccessat2 sys_faccessat2
440 n32 process_madvise sys_process_madvise
+441 n32 mount_setattr sys_mount_setattr
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 9e4ea3c31b1c..94b24e6b2608 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -355,3 +355,4 @@
438 n64 pidfd_getfd sys_pidfd_getfd
439 n64 faccessat2 sys_faccessat2
440 n64 process_madvise sys_process_madvise
+441 n64 mount_setattr sys_mount_setattr
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 29f5f28cf5ce..eae522306767 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -428,3 +428,4 @@
438 o32 pidfd_getfd sys_pidfd_getfd
439 o32 faccessat2 sys_faccessat2
440 o32 process_madvise sys_process_madvise
+441 o32 mount_setattr sys_mount_setattr
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index f375ea528e59..c7e25f1d219f 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common process_madvise sys_process_madvise
+441 common mount_setattr sys_mount_setattr
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 1275daec7fec..0b309ef64e91 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -530,3 +530,4 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common process_madvise sys_process_madvise
+441 common mount_setattr sys_mount_setattr
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 28c168000483..0b30398fee42 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -443,3 +443,4 @@
438 common pidfd_getfd sys_pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2 sys_faccessat2
440 common process_madvise sys_process_madvise sys_process_madvise
+441 common mount_setattr sys_mount_setattr sys_mount_setattr
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 783738448ff5..8e4949c5b740 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -443,3 +443,4 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common process_madvise sys_process_madvise
+441 common mount_setattr sys_mount_setattr
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 78160260991b..409f21a650b8 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -486,3 +486,4 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common process_madvise sys_process_madvise
+441 common mount_setattr sys_mount_setattr
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 0d0667a9fbd7..2a694420f6cd 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -445,3 +445,4 @@
438 i386 pidfd_getfd sys_pidfd_getfd
439 i386 faccessat2 sys_faccessat2
440 i386 process_madvise sys_process_madvise
+441 i386 mount_setattr sys_mount_setattr
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 379819244b91..4d594d0246c1 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -362,6 +362,7 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common process_madvise sys_process_madvise
+441 common mount_setattr sys_mount_setattr

#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index b070f272995d..a650dc05593d 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -411,3 +411,4 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common process_madvise sys_process_madvise
+441 common mount_setattr sys_mount_setattr
diff --git a/fs/internal.h b/fs/internal.h
index a7cd0f64faa4..a5a6c470dc07 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -82,6 +82,14 @@ int may_linkat(struct path *link);
/*
* namespace.c
*/
+struct mount_kattr {
+ unsigned int attr_set;
+ unsigned int attr_clr;
+ unsigned int propagation;
+ unsigned int lookup_flags;
+ bool recurse;
+};
+
extern struct vfsmount *lookup_mnt(const struct path *);
extern int finish_automount(struct vfsmount *, struct path *);

diff --git a/fs/namespace.c b/fs/namespace.c
index 8497d149ecaa..f9ea31b7eb7f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -469,10 +469,8 @@ void mnt_drop_write_file(struct file *file)
}
EXPORT_SYMBOL(mnt_drop_write_file);

-static int mnt_make_readonly(struct mount *mnt)
+static inline int mnt_hold_writers(struct mount *mnt)
{
- int ret = 0;
-
mnt->mnt.mnt_flags |= MNT_WRITE_HOLD;
/*
* After storing MNT_WRITE_HOLD, we'll read the counters. This store
@@ -497,15 +495,29 @@ static int mnt_make_readonly(struct mount *mnt)
* we're counting up here.
*/
if (mnt_get_writers(mnt) > 0)
- ret = -EBUSY;
- else
- mnt->mnt.mnt_flags |= MNT_READONLY;
+ return -EBUSY;
+
+ return 0;
+}
+
+static inline void mnt_unhold_writers(struct mount *mnt)
+{
/*
* MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers
* that become unheld will see MNT_READONLY.
*/
smp_wmb();
mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
+}
+
+static int mnt_make_readonly(struct mount *mnt)
+{
+ int ret;
+
+ ret = mnt_hold_writers(mnt);
+ if (!ret)
+ mnt->mnt.mnt_flags |= MNT_READONLY;
+ mnt_unhold_writers(mnt);
return ret;
}

@@ -3438,6 +3450,33 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
return ret;
}

+static int build_attr_flags(unsigned int attr_flags, unsigned int *flags)
+{
+ unsigned int aflags = 0;
+
+ if (attr_flags & ~(MOUNT_ATTR_RDONLY |
+ MOUNT_ATTR_NOSUID |
+ MOUNT_ATTR_NODEV |
+ MOUNT_ATTR_NOEXEC |
+ MOUNT_ATTR__ATIME |
+ MOUNT_ATTR_NODIRATIME))
+ return -EINVAL;
+
+ if (attr_flags & MOUNT_ATTR_RDONLY)
+ aflags |= MNT_READONLY;
+ if (attr_flags & MOUNT_ATTR_NOSUID)
+ aflags |= MNT_NOSUID;
+ if (attr_flags & MOUNT_ATTR_NODEV)
+ aflags |= MNT_NODEV;
+ if (attr_flags & MOUNT_ATTR_NOEXEC)
+ aflags |= MNT_NOEXEC;
+ if (attr_flags & MOUNT_ATTR_NODIRATIME)
+ aflags |= MNT_NODIRATIME;
+
+ *flags = aflags;
+ return 0;
+}
+
/*
* Create a kernel mount representation for a new, prepared superblock
* (specified by fs_fd) and attach to an open_tree-like file descriptor.
@@ -3460,24 +3499,9 @@ SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags,
if ((flags & ~(FSMOUNT_CLOEXEC)) != 0)
return -EINVAL;

- if (attr_flags & ~(MOUNT_ATTR_RDONLY |
- MOUNT_ATTR_NOSUID |
- MOUNT_ATTR_NODEV |
- MOUNT_ATTR_NOEXEC |
- MOUNT_ATTR__ATIME |
- MOUNT_ATTR_NODIRATIME))
- return -EINVAL;
-
- if (attr_flags & MOUNT_ATTR_RDONLY)
- mnt_flags |= MNT_READONLY;
- if (attr_flags & MOUNT_ATTR_NOSUID)
- mnt_flags |= MNT_NOSUID;
- if (attr_flags & MOUNT_ATTR_NODEV)
- mnt_flags |= MNT_NODEV;
- if (attr_flags & MOUNT_ATTR_NOEXEC)
- mnt_flags |= MNT_NOEXEC;
- if (attr_flags & MOUNT_ATTR_NODIRATIME)
- mnt_flags |= MNT_NODIRATIME;
+ ret = build_attr_flags(attr_flags, &mnt_flags);
+ if (ret)
+ return ret;

switch (attr_flags & MOUNT_ATTR__ATIME) {
case MOUNT_ATTR_STRICTATIME:
@@ -3785,6 +3809,261 @@ SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
return error;
}

+static unsigned int recalc_flags(struct mount_kattr *kattr, struct mount *mnt)
+{
+ unsigned int flags = mnt->mnt.mnt_flags;
+
+ /* flags to clear */
+ flags &= ~kattr->attr_clr;
+ /* flags to raise */
+ flags |= kattr->attr_set;
+
+ return flags;
+}
+
+static struct mount *mount_setattr_prepare(struct mount_kattr *kattr,
+ struct mount *mnt, int *err)
+{
+ struct mount *m = mnt, *last = NULL;
+
+ if (!is_mounted(&m->mnt)) {
+ *err = -EINVAL;
+ goto out;
+ }
+
+ if (!(mnt_has_parent(m) ? check_mnt(m) : is_anon_ns(m->mnt_ns))) {
+ *err = -EINVAL;
+ goto out;
+ }
+
+ do {
+ unsigned int flags;
+
+ flags = recalc_flags(kattr, m);
+ if (!can_change_locked_flags(m, flags)) {
+ *err = -EPERM;
+ goto out;
+ }
+
+ last = m;
+
+ if ((kattr->attr_set & MNT_READONLY) &&
+ !(m->mnt.mnt_flags & MNT_READONLY)) {
+ *err = mnt_hold_writers(m);
+ if (*err)
+ goto out;
+ }
+ } while (kattr->recurse && (m = next_mnt(m, mnt)));
+
+out:
+ return last;
+}
+
+static void mount_setattr_commit(struct mount_kattr *kattr, struct mount *mnt,
+ struct mount *last, int err)
+{
+ struct mount *m = mnt;
+
+ do {
+ if (!err) {
+ unsigned int flags;
+
+ flags = recalc_flags(kattr, m);
+ WRITE_ONCE(m->mnt.mnt_flags, flags);
+ }
+
+ /*
+ * We either set MNT_READONLY above so make it visible
+ * before ~MNT_WRITE_HOLD or we failed to recursively
+ * apply mount options.
+ */
+ if ((kattr->attr_set & MNT_READONLY) &&
+ (m->mnt.mnt_flags & MNT_WRITE_HOLD))
+ mnt_unhold_writers(m);
+
+ if (!err && kattr->propagation)
+ change_mnt_propagation(m, kattr->propagation);
+
+ /*
+ * On failure, only cleanup until we found the first mount we
+ * failed to handle.
+ */
+ if (err && m == last)
+ break;
+ } while (kattr->recurse && (m = next_mnt(m, mnt)));
+
+ if (!err)
+ touch_mnt_namespace(mnt->mnt_ns);
+}
+
+static int do_mount_setattr(struct path *path, struct mount_kattr *kattr)
+{
+ struct mount *mnt = real_mount(path->mnt), *last = NULL;
+ int err = 0;
+
+ if (path->dentry != mnt->mnt.mnt_root)
+ return -EINVAL;
+
+ if (kattr->propagation) {
+ /*
+ * Only take namespace_lock() if we're actually changing
+ * propagation.
+ */
+ namespace_lock();
+ if (kattr->propagation == MS_SHARED) {
+ err = invent_group_ids(mnt, kattr->recurse);
+ if (err) {
+ namespace_unlock();
+ return err;
+ }
+ }
+ }
+
+ lock_mount_hash();
+
+ /*
+ * Get the mount tree in a shape where we can change mount properties
+ * without failure.
+ */
+ last = mount_setattr_prepare(kattr, mnt, &err);
+ if (last) /* Commit all changes or revert to the old state. */
+ mount_setattr_commit(kattr, mnt, last, err);
+
+ unlock_mount_hash();
+
+ if (kattr->propagation) {
+ namespace_unlock();
+ if (err)
+ cleanup_group_ids(mnt, NULL);
+ }
+
+ return err;
+}
+
+static int build_mount_kattr(const struct mount_attr *attr,
+ struct mount_kattr *kattr, unsigned int flags)
+{
+ unsigned int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW;
+
+ if (flags & AT_NO_AUTOMOUNT)
+ lookup_flags &= ~LOOKUP_AUTOMOUNT;
+ if (flags & AT_SYMLINK_NOFOLLOW)
+ lookup_flags &= ~LOOKUP_FOLLOW;
+ if (flags & AT_EMPTY_PATH)
+ lookup_flags |= LOOKUP_EMPTY;
+
+ *kattr = (struct mount_kattr){
+ .lookup_flags = lookup_flags,
+ .recurse = !!(flags & AT_RECURSIVE),
+ };
+
+ switch (attr->propagation) {
+ case MAKE_PROPAGATION_UNCHANGED:
+ kattr->propagation = 0;
+ break;
+ case MAKE_PROPAGATION_UNBINDABLE:
+ kattr->propagation = MS_UNBINDABLE;
+ break;
+ case MAKE_PROPAGATION_PRIVATE:
+ kattr->propagation = MS_PRIVATE;
+ break;
+ case MAKE_PROPAGATION_DEPENDENT:
+ kattr->propagation = MS_SLAVE;
+ break;
+ case MAKE_PROPAGATION_SHARED:
+ kattr->propagation = MS_SHARED;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ if (upper_32_bits(attr->attr_set))
+ return -EINVAL;
+ if (build_attr_flags(lower_32_bits(attr->attr_set), &kattr->attr_set))
+ return -EINVAL;
+
+ if (upper_32_bits(attr->attr_clr))
+ return -EINVAL;
+ if (build_attr_flags(lower_32_bits(attr->attr_clr), &kattr->attr_clr))
+ return -EINVAL;
+
+ /*
+ * Since the MOUNT_ATTR_<atime> values are an enum, not a bitmap, users
+ * wanting to transition to a different atime setting cannot simply
+ * specify the atime setting in @attr_set, but must also specify
+ * MOUNT_ATTR__ATIME in the @attr_clr field.
+ * So ensure that MOUNT_ATTR__ATIME can't be partially set in
+ * @attr_clr and that @attr_set can't have any atime bits set if
+ * MOUNT_ATTR__ATIME isn't set in @attr_clr.
+ */
+ if (!(attr->attr_clr & MOUNT_ATTR__ATIME) && (attr->attr_set & MOUNT_ATTR__ATIME))
+ return -EINVAL;
+ else if ((attr->attr_clr & MOUNT_ATTR__ATIME) &&
+ ((attr->attr_clr & MOUNT_ATTR__ATIME) != MOUNT_ATTR__ATIME))
+ return -EINVAL;
+
+ if (attr->attr_clr & MOUNT_ATTR__ATIME) {
+ /* Clear all previous time settings as they are mutually exclusive. */
+ kattr->attr_clr |= MNT_RELATIME | MNT_NOATIME;
+ switch (attr->attr_set & MOUNT_ATTR__ATIME) {
+ case MOUNT_ATTR_RELATIME:
+ kattr->attr_set |= MNT_RELATIME;
+ break;
+ case MOUNT_ATTR_NOATIME:
+ kattr->attr_set |= MNT_NOATIME;
+ break;
+ case MOUNT_ATTR_STRICTATIME:
+ break;
+ default:
+ return -EINVAL;
+ }
+ }
+
+ return 0;
+}
+
+SYSCALL_DEFINE5(mount_setattr, int, dfd, const char __user *, path, unsigned int, flags,
+ struct mount_attr __user *, uattr, size_t, usize)
+{
+ int err;
+ struct path target;
+ struct mount_attr attr;
+ struct mount_kattr kattr;
+
+ BUILD_BUG_ON(sizeof(struct mount_attr) < MOUNT_ATTR_SIZE_VER0);
+ BUILD_BUG_ON(sizeof(struct mount_attr) != MOUNT_ATTR_SIZE_LATEST);
+
+ if (flags & ~(AT_EMPTY_PATH | AT_RECURSIVE | AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT))
+ return -EINVAL;
+
+ if (unlikely(usize > PAGE_SIZE))
+ return -E2BIG;
+ if (unlikely(usize < MOUNT_ATTR_SIZE_VER0))
+ return -EINVAL;
+
+ if (!may_mount())
+ return -EPERM;
+
+ err = copy_struct_from_user(&attr, sizeof(attr), uattr, usize);
+ if (err)
+ return err;
+
+ if (attr.attr_set == 0 && attr.attr_clr == 0 && attr.propagation == 0)
+ return 0;
+
+ err = build_mount_kattr(&attr, &kattr, flags);
+ if (err)
+ return err;
+
+ err = user_path_at(dfd, path, kattr.lookup_flags, &target);
+ if (err)
+ return err;
+
+ err = do_mount_setattr(&target, &kattr);
+ path_put(&target);
+ return err;
+}
+
static void __init init_mount_tree(void)
{
struct vfsmount *mnt;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 37bea07c12f2..a62d5904fb6a 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -68,6 +68,7 @@ union bpf_attr;
struct io_uring_params;
struct clone_args;
struct open_how;
+struct mount_attr;

#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -999,6 +1000,8 @@ asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
int to_dfd, const char __user *to_path,
unsigned int ms_flags);
+asmlinkage long sys_mount_setattr(int dfd, const char __user *path, unsigned int flags,
+ struct mount_attr __user *uattr, size_t usize);
asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
const void __user *value, int aux);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 2056318988f7..0517f36fe783 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -859,9 +859,11 @@ __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
__SYSCALL(__NR_faccessat2, sys_faccessat2)
#define __NR_process_madvise 440
__SYSCALL(__NR_process_madvise, sys_process_madvise)
+#define __NR_mount_setattr 441
+__SYSCALL(__NR_mount_setattr, sys_mount_setattr)

#undef __NR_syscalls
-#define __NR_syscalls 441
+#define __NR_syscalls 442

/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index dd8306ea336c..fba42b4bfc1c 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -1,6 +1,8 @@
#ifndef _UAPI_LINUX_MOUNT_H
#define _UAPI_LINUX_MOUNT_H

+#include <linux/types.h>
+
/*
* These are the fs-independent mount-flags: up to 32 flags are supported
*
@@ -118,4 +120,26 @@ enum fsconfig_command {
#define MOUNT_ATTR_STRICTATIME 0x00000020 /* - Always perform atime updates */
#define MOUNT_ATTR_NODIRATIME 0x00000080 /* Do not update directory access times */

+/*
+ * mount_setattr()
+ */
+struct mount_attr {
+ __u64 attr_set;
+ __u64 attr_clr;
+ __u64 propagation;
+};
+
+/* Change propagation through mount_setattr(). */
+enum propagation_type {
+ MAKE_PROPAGATION_UNCHANGED = 0, /* Don't change mount propagation (default). */
+ MAKE_PROPAGATION_UNBINDABLE = 1, /* Make unbindable. */
+ MAKE_PROPAGATION_PRIVATE = 2, /* Do not receive or send mount events. */
+ MAKE_PROPAGATION_DEPENDENT = 3, /* Only receive mount events. */
+ MAKE_PROPAGATION_SHARED = 4, /* Send and receive mount events. */
+};
+
+/* List of all mount_attr versions. */
+#define MOUNT_ATTR_SIZE_VER0 24 /* sizeof first published struct */
+#define MOUNT_ATTR_SIZE_LATEST MOUNT_ATTR_SIZE_VER0
+
#endif /* _UAPI_LINUX_MOUNT_H */
diff --git a/tools/include/uapi/asm-generic/unistd.h b/tools/include/uapi/asm-generic/unistd.h
index 2056318988f7..0517f36fe783 100644
--- a/tools/include/uapi/asm-generic/unistd.h
+++ b/tools/include/uapi/asm-generic/unistd.h
@@ -859,9 +859,11 @@ __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
__SYSCALL(__NR_faccessat2, sys_faccessat2)
#define __NR_process_madvise 440
__SYSCALL(__NR_process_madvise, sys_process_madvise)
+#define __NR_mount_setattr 441
+__SYSCALL(__NR_mount_setattr, sys_mount_setattr)

#undef __NR_syscalls
-#define __NR_syscalls 441
+#define __NR_syscalls 442

/*
* 32 bit systems traditionally used different
--
2.29.2

2020-11-28 22:26:00

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v3 05/38] tests: add mount_setattr() selftests

Add a range of selftests for the new mount_setattr() syscall to verify
that it works as expected. This tests that:
- no invalid flags can be specified
- changing properties of a single mount works and leaves other mounts in
the mount tree unchanged
- changing a mount tre to read-only when one of the mounts has writers
fails and leaves the whole mount tree unchanged
- changing mount properties from multiple threads works
- changing atime settings works
- changing mount propagation works
- changing the mount options of a mount tree where the individual mounts
in the tree have different mount options only changes the flags that
were requested to change
- changing mount options from another mount namespace fails
- changing mount options from another user namespace fails

[==========] Running 9 tests from 2 test cases.
[ RUN ] mount_setattr.invalid_attributes
[ OK ] mount_setattr.invalid_attributes
[ RUN ] mount_setattr.basic
[ OK ] mount_setattr.basic
[ RUN ] mount_setattr.basic_recursive
[ OK ] mount_setattr.basic_recursive
[ RUN ] mount_setattr.mount_has_writers
[ OK ] mount_setattr.mount_has_writers
[ RUN ] mount_setattr.mixed_mount_options
[ OK ] mount_setattr.mixed_mount_options
[ RUN ] mount_setattr.time_changes
[ OK ] mount_setattr.time_changes
[ RUN ] mount_setattr.multi_threaded
[ OK ] mount_setattr.multi_threaded
[ RUN ] mount_setattr.wrong_user_namespace
[ OK ] mount_setattr.wrong_user_namespace
[ RUN ] mount_setattr.wrong_mount_namespace
[ OK ] mount_setattr.wrong_mount_namespace
[==========] 9 / 9 tests passed.
[ PASSED ]

Cc: Christoph Hellwig <[email protected]>
Cc: David Howells <[email protected]>
Cc: Al Viro <[email protected]>
Cc: [email protected]
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
unchanged

/* v3 */
unchanged
---
tools/testing/selftests/Makefile | 1 +
.../selftests/mount_setattr/.gitignore | 1 +
.../testing/selftests/mount_setattr/Makefile | 7 +
tools/testing/selftests/mount_setattr/config | 1 +
.../mount_setattr/mount_setattr_test.c | 941 ++++++++++++++++++
5 files changed, 951 insertions(+)
create mode 100644 tools/testing/selftests/mount_setattr/.gitignore
create mode 100644 tools/testing/selftests/mount_setattr/Makefile
create mode 100644 tools/testing/selftests/mount_setattr/config
create mode 100644 tools/testing/selftests/mount_setattr/mount_setattr_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index d9c283503159..87b7107dd9a6 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -34,6 +34,7 @@ TARGETS += memfd
TARGETS += memory-hotplug
TARGETS += mincore
TARGETS += mount
+TARGETS += mount_setattr
TARGETS += mqueue
TARGETS += net
TARGETS += net/forwarding
diff --git a/tools/testing/selftests/mount_setattr/.gitignore b/tools/testing/selftests/mount_setattr/.gitignore
new file mode 100644
index 000000000000..5f74d8488472
--- /dev/null
+++ b/tools/testing/selftests/mount_setattr/.gitignore
@@ -0,0 +1 @@
+mount_setattr_test
diff --git a/tools/testing/selftests/mount_setattr/Makefile b/tools/testing/selftests/mount_setattr/Makefile
new file mode 100644
index 000000000000..2250f7dcb81e
--- /dev/null
+++ b/tools/testing/selftests/mount_setattr/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+# Makefile for mount selftests.
+CFLAGS = -g -I../../../../usr/include/ -Wall -O2 -pthread
+
+TEST_GEN_FILES += mount_setattr_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/mount_setattr/config b/tools/testing/selftests/mount_setattr/config
new file mode 100644
index 000000000000..416bd53ce982
--- /dev/null
+++ b/tools/testing/selftests/mount_setattr/config
@@ -0,0 +1 @@
+CONFIG_USER_NS=y
diff --git a/tools/testing/selftests/mount_setattr/mount_setattr_test.c b/tools/testing/selftests/mount_setattr/mount_setattr_test.c
new file mode 100644
index 000000000000..d940356ab277
--- /dev/null
+++ b/tools/testing/selftests/mount_setattr/mount_setattr_test.c
@@ -0,0 +1,941 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdio.h>
+#include <errno.h>
+#include <pthread.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <sys/wait.h>
+#include <sys/vfs.h>
+#include <sys/statvfs.h>
+#include <sys/sysinfo.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <grp.h>
+#include <stdbool.h>
+#include <stdarg.h>
+
+#include "../kselftest_harness.h"
+
+#ifndef CLONE_NEWNS
+#define CLONE_NEWNS 0x00020000
+#endif
+
+#ifndef CLONE_NEWUSER
+#define CLONE_NEWUSER 0x10000000
+#endif
+
+#ifndef MS_REC
+#define MS_REC 16384
+#endif
+
+#ifndef MS_RELATIME
+#define MS_RELATIME (1 << 21)
+#endif
+
+#ifndef MS_STRICTATIME
+#define MS_STRICTATIME (1 << 24)
+#endif
+
+#ifndef MOUNT_ATTR_RDONLY
+#define MOUNT_ATTR_RDONLY 0x00000001
+#endif
+
+#ifndef MOUNT_ATTR_NOSUID
+#define MOUNT_ATTR_NOSUID 0x00000002
+#endif
+
+#ifndef MOUNT_ATTR_NOEXEC
+#define MOUNT_ATTR_NOEXEC 0x00000008
+#endif
+
+#ifndef MOUNT_ATTR_NODIRATIME
+#define MOUNT_ATTR_NODIRATIME 0x00000080
+#endif
+
+#ifndef MOUNT_ATTR__ATIME
+#define MOUNT_ATTR__ATIME 0x00000070
+#endif
+
+#ifndef MOUNT_ATTR_RELATIME
+#define MOUNT_ATTR_RELATIME 0x00000000
+#endif
+
+#ifndef MOUNT_ATTR_NOATIME
+#define MOUNT_ATTR_NOATIME 0x00000010
+#endif
+
+#ifndef MOUNT_ATTR_STRICTATIME
+#define MOUNT_ATTR_STRICTATIME 0x00000020
+#endif
+
+#ifndef AT_RECURSIVE
+#define AT_RECURSIVE 0x8000
+#endif
+
+#ifndef MAKE_PROPAGATION_SHARED
+#define MAKE_PROPAGATION_SHARED 4
+#endif
+
+#define DEFAULT_THREADS 4
+#define ptr_to_int(p) ((int)((intptr_t)(p)))
+#define int_to_ptr(u) ((void *)((intptr_t)(u)))
+
+#ifndef __NR_mount_setattr
+ #if defined __alpha__
+ #define __NR_mount_setattr 551
+ #elif defined _MIPS_SIM
+ #if _MIPS_SIM == _MIPS_SIM_ABI32 /* o32 */
+ #define __NR_mount_setattr 4441
+ #endif
+ #if _MIPS_SIM == _MIPS_SIM_NABI32 /* n32 */
+ #define __NR_mount_setattr 6441
+ #endif
+ #if _MIPS_SIM == _MIPS_SIM_ABI64 /* n64 */
+ #define __NR_mount_setattr 5441
+ #endif
+ #elif defined __ia64__
+ #define __NR_mount_setattr (441 + 1024)
+ #else
+ #define __NR_mount_setattr 441
+ #endif
+
+struct mount_attr {
+ __u64 attr_set;
+ __u64 attr_clr;
+ __u64 propagation;
+};
+#endif
+
+static inline int sys_mount_setattr(int dfd, const char *path, unsigned int flags,
+ struct mount_attr *attr, size_t size)
+{
+ return syscall(__NR_mount_setattr, dfd, path, flags, attr, size);
+}
+
+static ssize_t write_nointr(int fd, const void *buf, size_t count)
+{
+ ssize_t ret;
+
+ do {
+ ret = write(fd, buf, count);
+ } while (ret < 0 && errno == EINTR);
+
+ return ret;
+}
+
+static int write_file(const char *path, const void *buf, size_t count)
+{
+ int fd;
+ ssize_t ret;
+
+ fd = open(path, O_WRONLY | O_CLOEXEC | O_NOCTTY | O_NOFOLLOW);
+ if (fd < 0)
+ return -1;
+
+ ret = write_nointr(fd, buf, count);
+ close(fd);
+ if (ret < 0 || (size_t)ret != count)
+ return -1;
+
+ return 0;
+}
+
+static int create_and_enter_userns(void)
+{
+ uid_t uid;
+ gid_t gid;
+ char map[100];
+
+ uid = getuid();
+ gid = getgid();
+
+ if (unshare(CLONE_NEWUSER))
+ return -1;
+
+ if (write_file("/proc/self/setgroups", "deny", sizeof("deny") - 1) &&
+ errno != ENOENT)
+ return -1;
+
+ snprintf(map, sizeof(map), "0 %d 1", uid);
+ if (write_file("/proc/self/uid_map", map, strlen(map)))
+ return -1;
+
+
+ snprintf(map, sizeof(map), "0 %d 1", gid);
+ if (write_file("/proc/self/gid_map", map, strlen(map)))
+ return -1;
+
+ if (setgid(0))
+ return -1;
+
+ if (setuid(0))
+ return -1;
+
+ return 0;
+}
+
+static int prepare_unpriv_mountns(void)
+{
+ if (create_and_enter_userns())
+ return -1;
+
+ if (unshare(CLONE_NEWNS))
+ return -1;
+
+ if (mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, 0))
+ return -1;
+
+ return 0;
+}
+
+static int read_mnt_flags(const char *path)
+{
+ int ret;
+ struct statvfs stat;
+ unsigned int mnt_flags;
+
+ ret = statvfs(path, &stat);
+ if (ret != 0)
+ return -EINVAL;
+
+ if (stat.f_flag &
+ ~(ST_RDONLY | ST_NOSUID | ST_NODEV | ST_NOEXEC | ST_NOATIME |
+ ST_NODIRATIME | ST_RELATIME | ST_SYNCHRONOUS | ST_MANDLOCK))
+ return -EINVAL;
+
+ mnt_flags = 0;
+ if (stat.f_flag & ST_RDONLY)
+ mnt_flags |= MS_RDONLY;
+ if (stat.f_flag & ST_NOSUID)
+ mnt_flags |= MS_NOSUID;
+ if (stat.f_flag & ST_NODEV)
+ mnt_flags |= MS_NODEV;
+ if (stat.f_flag & ST_NOEXEC)
+ mnt_flags |= MS_NOEXEC;
+ if (stat.f_flag & ST_NOATIME)
+ mnt_flags |= MS_NOATIME;
+ if (stat.f_flag & ST_NODIRATIME)
+ mnt_flags |= MS_NODIRATIME;
+ if (stat.f_flag & ST_RELATIME)
+ mnt_flags |= MS_RELATIME;
+ if (stat.f_flag & ST_SYNCHRONOUS)
+ mnt_flags |= MS_SYNCHRONOUS;
+ if (stat.f_flag & ST_MANDLOCK)
+ mnt_flags |= ST_MANDLOCK;
+
+ return mnt_flags;
+}
+
+static char *get_field(char *src, int nfields)
+{
+ int i;
+ char *p = src;
+
+ for (i = 0; i < nfields; i++) {
+ while (*p && *p != ' ' && *p != '\t')
+ p++;
+
+ if (!*p)
+ break;
+
+ p++;
+ }
+
+ return p;
+}
+
+static void null_endofword(char *word)
+{
+ while (*word && *word != ' ' && *word != '\t')
+ word++;
+ *word = '\0';
+}
+
+static bool is_shared_mount(const char *path)
+{
+ size_t len = 0;
+ char *line = NULL;
+ FILE *f = NULL;
+
+ f = fopen("/proc/self/mountinfo", "re");
+ if (!f)
+ return false;
+
+ while (getline(&line, &len, f) != -1) {
+ char *opts, *target;
+
+ target = get_field(line, 4);
+ if (!target)
+ continue;
+
+ opts = get_field(target, 2);
+ if (!opts)
+ continue;
+
+ null_endofword(target);
+
+ if (strcmp(target, path) != 0)
+ continue;
+
+ null_endofword(opts);
+ if (strstr(opts, "shared:"))
+ return true;
+ }
+
+ free(line);
+ fclose(f);
+
+ return false;
+}
+
+static void *mount_setattr_thread(void *data)
+{
+ struct mount_attr attr = {
+ .attr_set = MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID,
+ .attr_clr = 0,
+ .propagation = MAKE_PROPAGATION_SHARED,
+ };
+
+ if (sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)))
+ pthread_exit(int_to_ptr(-1));
+
+ pthread_exit(int_to_ptr(0));
+}
+
+/* Attempt to de-conflict with the selftests tree. */
+#ifndef SKIP
+#define SKIP(s, ...) XFAIL(s, ##__VA_ARGS__)
+#endif
+
+static bool mount_setattr_supported(void)
+{
+ int ret;
+
+ ret = sys_mount_setattr(-EBADF, "", AT_EMPTY_PATH, NULL, 0);
+ if (ret < 0 && errno == ENOSYS)
+ return false;
+
+ return true;
+}
+
+FIXTURE(mount_setattr) {
+};
+
+FIXTURE_SETUP(mount_setattr)
+{
+ if (!mount_setattr_supported())
+ SKIP(return, "mount_setattr syscall not supported");
+
+ ASSERT_EQ(prepare_unpriv_mountns(), 0);
+
+ (void)umount2("/mnt", MNT_DETACH);
+ (void)umount2("/tmp", MNT_DETACH);
+
+ ASSERT_EQ(mount("testing", "/tmp", "tmpfs", MS_NOATIME | MS_NODEV,
+ "size=100000,mode=700"), 0);
+
+ ASSERT_EQ(mkdir("/tmp/B", 0777), 0);
+
+ ASSERT_EQ(mount("testing", "/tmp/B", "tmpfs", MS_NOATIME | MS_NODEV,
+ "size=100000,mode=700"), 0);
+
+ ASSERT_EQ(mkdir("/tmp/B/BB", 0777), 0);
+
+ ASSERT_EQ(mount("testing", "/tmp/B/BB", "tmpfs", MS_NOATIME | MS_NODEV,
+ "size=100000,mode=700"), 0);
+
+ ASSERT_EQ(mount("testing", "/mnt", "tmpfs", MS_NOATIME | MS_NODEV,
+ "size=100000,mode=700"), 0);
+
+ ASSERT_EQ(mkdir("/mnt/A", 0777), 0);
+
+ ASSERT_EQ(mount("testing", "/mnt/A", "tmpfs", MS_NOATIME | MS_NODEV,
+ "size=100000,mode=700"), 0);
+
+ ASSERT_EQ(mkdir("/mnt/A/AA", 0777), 0);
+
+ ASSERT_EQ(mount("/tmp", "/mnt/A/AA", NULL, MS_BIND | MS_REC, NULL), 0);
+
+ ASSERT_EQ(mkdir("/mnt/B", 0777), 0);
+
+ ASSERT_EQ(mount("testing", "/mnt/B", "ramfs",
+ MS_NOATIME | MS_NODEV | MS_NOSUID, 0), 0);
+
+ ASSERT_EQ(mkdir("/mnt/B/BB", 0777), 0);
+
+ ASSERT_EQ(mount("testing", "/tmp/B/BB", "devpts",
+ MS_RELATIME | MS_NOEXEC | MS_RDONLY, 0), 0);
+}
+
+FIXTURE_TEARDOWN(mount_setattr)
+{
+ if (!mount_setattr_supported())
+ SKIP(return, "mount_setattr syscall not supported");
+
+ (void)umount2("/mnt/A", MNT_DETACH);
+ (void)umount2("/tmp", MNT_DETACH);
+}
+
+TEST_F(mount_setattr, invalid_attributes)
+{
+ struct mount_attr invalid_attr = {
+ .attr_set = (1U << 31),
+ };
+
+ if (!mount_setattr_supported())
+ SKIP(return, "mount_setattr syscall not supported");
+
+ ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &invalid_attr,
+ sizeof(invalid_attr)), 0);
+
+ invalid_attr.attr_set = 0;
+ invalid_attr.attr_clr = (1U << 31);
+ ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &invalid_attr,
+ sizeof(invalid_attr)), 0);
+
+ invalid_attr.attr_clr = 0;
+ invalid_attr.propagation = (1U << 31);
+ ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &invalid_attr,
+ sizeof(invalid_attr)), 0);
+
+ invalid_attr.attr_set = (1U << 31);
+ invalid_attr.attr_clr = (1U << 31);
+ invalid_attr.propagation = (1U << 31);
+ ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &invalid_attr,
+ sizeof(invalid_attr)), 0);
+
+ ASSERT_NE(sys_mount_setattr(-1, "mnt/A", AT_RECURSIVE, &invalid_attr,
+ sizeof(invalid_attr)), 0);
+}
+
+TEST_F(mount_setattr, extensibility)
+{
+ unsigned int old_flags = 0, new_flags = 0, expected_flags = 0;
+ char *s = "dummy";
+ struct mount_attr invalid_attr = {};
+ struct mount_attr_large {
+ struct mount_attr attr1;
+ struct mount_attr attr2;
+ struct mount_attr attr3;
+ } large_attr = {};
+
+ if (!mount_setattr_supported())
+ SKIP(return, "mount_setattr syscall not supported");
+
+ old_flags = read_mnt_flags("/mnt/A");
+ ASSERT_GT(old_flags, 0);
+
+ ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, NULL,
+ sizeof(invalid_attr)), 0);
+ ASSERT_EQ(errno, EFAULT);
+
+ ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, (void *)s,
+ sizeof(invalid_attr)), 0);
+ ASSERT_EQ(errno, EINVAL);
+
+ ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &invalid_attr, 0), 0);
+ ASSERT_EQ(errno, EINVAL);
+
+ ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &invalid_attr,
+ sizeof(invalid_attr) / 2), 0);
+ ASSERT_EQ(errno, EINVAL);
+
+ ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &invalid_attr,
+ sizeof(invalid_attr) / 2), 0);
+ ASSERT_EQ(errno, EINVAL);
+
+ ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE,
+ (void *)&large_attr, sizeof(large_attr)), 0);
+
+ large_attr.attr3.attr_set = MOUNT_ATTR_RDONLY;
+ ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE,
+ (void *)&large_attr, sizeof(large_attr)), 0);
+
+ large_attr.attr3.attr_set = 0;
+ large_attr.attr1.attr_set = MOUNT_ATTR_RDONLY;
+ ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE,
+ (void *)&large_attr, sizeof(large_attr)), 0);
+
+ expected_flags = old_flags;
+ expected_flags |= MS_RDONLY;
+
+ new_flags = read_mnt_flags("/mnt/A");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+ ASSERT_EQ(new_flags, expected_flags);
+}
+
+TEST_F(mount_setattr, basic)
+{
+ unsigned int old_flags = 0, new_flags = 0, expected_flags = 0;
+ struct mount_attr attr = {
+ .attr_set = MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOEXEC | MOUNT_ATTR_RELATIME,
+ .attr_clr = MOUNT_ATTR__ATIME,
+ };
+
+ if (!mount_setattr_supported())
+ SKIP(return, "mount_setattr syscall not supported");
+
+ old_flags = read_mnt_flags("/mnt/A");
+ ASSERT_GT(old_flags, 0);
+
+ ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", 0, &attr, sizeof(attr)), 0);
+
+ expected_flags = old_flags;
+ expected_flags |= MS_RDONLY;
+ expected_flags |= MS_NOEXEC;
+ expected_flags &= ~MS_NOATIME;
+ expected_flags |= MS_RELATIME;
+
+ new_flags = read_mnt_flags("/mnt/A");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA");
+ ASSERT_EQ(new_flags, old_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B");
+ ASSERT_EQ(new_flags, old_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+ ASSERT_EQ(new_flags, old_flags);
+}
+
+TEST_F(mount_setattr, basic_recursive)
+{
+ int fd;
+ unsigned int old_flags = 0, new_flags = 0, expected_flags = 0;
+ struct mount_attr attr = {
+ .attr_set = MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOEXEC | MOUNT_ATTR_RELATIME,
+ .attr_clr = MOUNT_ATTR__ATIME,
+ };
+
+ if (!mount_setattr_supported())
+ SKIP(return, "mount_setattr syscall not supported");
+
+ old_flags = read_mnt_flags("/mnt/A");
+ ASSERT_GT(old_flags, 0);
+
+ ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+ expected_flags = old_flags;
+ expected_flags |= MS_RDONLY;
+ expected_flags |= MS_NOEXEC;
+ expected_flags &= ~MS_NOATIME;
+ expected_flags |= MS_RELATIME;
+
+ new_flags = read_mnt_flags("/mnt/A");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ memset(&attr, 0, sizeof(attr));
+ attr.attr_clr = MOUNT_ATTR_RDONLY;
+ attr.propagation = MAKE_PROPAGATION_SHARED;
+ ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+ expected_flags &= ~MS_RDONLY;
+ new_flags = read_mnt_flags("/mnt/A");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ ASSERT_EQ(is_shared_mount("/mnt/A"), true);
+
+ new_flags = read_mnt_flags("/mnt/A/AA");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ ASSERT_EQ(is_shared_mount("/mnt/A/AA"), true);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ ASSERT_EQ(is_shared_mount("/mnt/A/AA/B"), true);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ ASSERT_EQ(is_shared_mount("/mnt/A/AA/B/BB"), true);
+
+ fd = open("/mnt/A/AA/B/b", O_RDWR | O_CLOEXEC | O_CREAT | O_EXCL, 0777);
+ ASSERT_GE(fd, 0);
+
+ /*
+ * We're holding a fd open for writing so this needs to fail somewhere
+ * in the middle and the mount options need to be unchanged.
+ */
+ attr.attr_set = MOUNT_ATTR_RDONLY;
+ ASSERT_LT(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+ new_flags = read_mnt_flags("/mnt/A");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ ASSERT_EQ(is_shared_mount("/mnt/A"), true);
+
+ new_flags = read_mnt_flags("/mnt/A/AA");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ ASSERT_EQ(is_shared_mount("/mnt/A/AA"), true);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ ASSERT_EQ(is_shared_mount("/mnt/A/AA/B"), true);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ ASSERT_EQ(is_shared_mount("/mnt/A/AA/B/BB"), true);
+
+ EXPECT_EQ(close(fd), 0);
+}
+
+TEST_F(mount_setattr, mount_has_writers)
+{
+ int fd, dfd;
+ unsigned int old_flags = 0, new_flags = 0;
+ struct mount_attr attr = {
+ .attr_set = MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOEXEC | MOUNT_ATTR_RELATIME,
+ .attr_clr = MOUNT_ATTR__ATIME,
+ .propagation = MAKE_PROPAGATION_SHARED,
+ };
+
+ if (!mount_setattr_supported())
+ SKIP(return, "mount_setattr syscall not supported");
+
+ old_flags = read_mnt_flags("/mnt/A");
+ ASSERT_GT(old_flags, 0);
+
+ fd = open("/mnt/A/AA/B/b", O_RDWR | O_CLOEXEC | O_CREAT | O_EXCL, 0777);
+ ASSERT_GE(fd, 0);
+
+ /*
+ * We're holding a fd open to a mount somwhere in the middle so this
+ * needs to fail somewhere in the middle. After this the mount options
+ * need to be unchanged.
+ */
+ ASSERT_LT(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+ new_flags = read_mnt_flags("/mnt/A");
+ ASSERT_EQ(new_flags, old_flags);
+
+ ASSERT_EQ(is_shared_mount("/mnt/A"), false);
+
+ new_flags = read_mnt_flags("/mnt/A/AA");
+ ASSERT_EQ(new_flags, old_flags);
+
+ ASSERT_EQ(is_shared_mount("/mnt/A/AA"), false);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B");
+ ASSERT_EQ(new_flags, old_flags);
+
+ ASSERT_EQ(is_shared_mount("/mnt/A/AA/B"), false);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+ ASSERT_EQ(new_flags, old_flags);
+
+ ASSERT_EQ(is_shared_mount("/mnt/A/AA/B/BB"), false);
+
+ dfd = open("/mnt/A/AA/B", O_DIRECTORY | O_CLOEXEC);
+ ASSERT_GE(dfd, 0);
+ EXPECT_EQ(fsync(dfd), 0);
+ EXPECT_EQ(close(dfd), 0);
+
+ EXPECT_EQ(fsync(fd), 0);
+ EXPECT_EQ(close(fd), 0);
+
+ /* All writers are gone so this should succeed. */
+ ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+}
+
+TEST_F(mount_setattr, mixed_mount_options)
+{
+ unsigned int old_flags1 = 0, old_flags2 = 0, new_flags = 0, expected_flags = 0;
+ struct mount_attr attr = {
+ .attr_clr = MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID | MOUNT_ATTR_NOEXEC | MOUNT_ATTR__ATIME,
+ .attr_set = MOUNT_ATTR_RELATIME,
+ };
+
+ if (!mount_setattr_supported())
+ SKIP(return, "mount_setattr syscall not supported");
+
+ old_flags1 = read_mnt_flags("/mnt/B");
+ ASSERT_GT(old_flags1, 0);
+
+ old_flags2 = read_mnt_flags("/mnt/B/BB");
+ ASSERT_GT(old_flags2, 0);
+
+ ASSERT_EQ(sys_mount_setattr(-1, "/mnt/B", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+ expected_flags = old_flags2;
+ expected_flags &= ~(MS_RDONLY | MS_NOEXEC | MS_NOATIME | MS_NOSUID);
+ expected_flags |= MS_RELATIME;
+
+ new_flags = read_mnt_flags("/mnt/B");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ expected_flags = old_flags2;
+ expected_flags &= ~(MS_RDONLY | MS_NOEXEC | MS_NOATIME | MS_NOSUID);
+ expected_flags |= MS_RELATIME;
+
+ new_flags = read_mnt_flags("/mnt/B/BB");
+ ASSERT_EQ(new_flags, expected_flags);
+}
+
+TEST_F(mount_setattr, time_changes)
+{
+ unsigned int old_flags = 0, new_flags = 0, expected_flags = 0;
+ struct mount_attr attr = {
+ .attr_set = MOUNT_ATTR_NODIRATIME | MOUNT_ATTR_NOATIME,
+ };
+
+ if (!mount_setattr_supported())
+ SKIP(return, "mount_setattr syscall not supported");
+
+ ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+ attr.attr_set = MOUNT_ATTR_STRICTATIME;
+ ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+ attr.attr_set = MOUNT_ATTR_STRICTATIME | MOUNT_ATTR_NOATIME;
+ ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+ attr.attr_set = MOUNT_ATTR_STRICTATIME | MOUNT_ATTR_NOATIME;
+ attr.attr_clr = MOUNT_ATTR__ATIME;
+ ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+ attr.attr_set = 0;
+ attr.attr_clr = MOUNT_ATTR_STRICTATIME;
+ ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+ attr.attr_clr = MOUNT_ATTR_NOATIME;
+ ASSERT_NE(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+ old_flags = read_mnt_flags("/mnt/A");
+ ASSERT_GT(old_flags, 0);
+
+ attr.attr_set = MOUNT_ATTR_NODIRATIME | MOUNT_ATTR_NOATIME;
+ attr.attr_clr = MOUNT_ATTR__ATIME;
+ ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+ expected_flags = old_flags;
+ expected_flags |= MS_NOATIME;
+ expected_flags |= MS_NODIRATIME;
+
+ new_flags = read_mnt_flags("/mnt/A");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ memset(&attr, 0, sizeof(attr));
+ attr.attr_set &= ~MOUNT_ATTR_NOATIME;
+ attr.attr_set |= MOUNT_ATTR_RELATIME;
+ attr.attr_clr |= MOUNT_ATTR__ATIME;
+ ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+ expected_flags &= ~MS_NOATIME;
+ expected_flags |= MS_RELATIME;
+
+ new_flags = read_mnt_flags("/mnt/A");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ memset(&attr, 0, sizeof(attr));
+ attr.attr_set &= ~MOUNT_ATTR_RELATIME;
+ attr.attr_set |= MOUNT_ATTR_STRICTATIME;
+ attr.attr_clr |= MOUNT_ATTR__ATIME;
+ ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+ expected_flags &= ~MS_RELATIME;
+
+ new_flags = read_mnt_flags("/mnt/A");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ memset(&attr, 0, sizeof(attr));
+ attr.attr_set &= ~MOUNT_ATTR_STRICTATIME;
+ attr.attr_set |= MOUNT_ATTR_NOATIME;
+ attr.attr_clr |= MOUNT_ATTR__ATIME;
+ ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+ expected_flags |= MS_NOATIME;
+ new_flags = read_mnt_flags("/mnt/A");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ memset(&attr, 0, sizeof(attr));
+ ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+ new_flags = read_mnt_flags("/mnt/A");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ memset(&attr, 0, sizeof(attr));
+ attr.attr_clr = MOUNT_ATTR_NODIRATIME;
+ ASSERT_EQ(sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr)), 0);
+
+ expected_flags &= ~MS_NODIRATIME;
+
+ new_flags = read_mnt_flags("/mnt/A");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+ ASSERT_EQ(new_flags, expected_flags);
+}
+
+TEST_F(mount_setattr, multi_threaded)
+{
+ int i, j, nthreads, ret = 0;
+ unsigned int old_flags = 0, new_flags = 0, expected_flags = 0;
+ pthread_attr_t pattr;
+ pthread_t threads[DEFAULT_THREADS];
+
+ if (!mount_setattr_supported())
+ SKIP(return, "mount_setattr syscall not supported");
+
+ old_flags = read_mnt_flags("/mnt/A");
+ ASSERT_GT(old_flags, 0);
+
+ /* Try to change mount options from multiple threads. */
+ nthreads = get_nprocs_conf();
+ if (nthreads > DEFAULT_THREADS)
+ nthreads = DEFAULT_THREADS;
+
+ pthread_attr_init(&pattr);
+ for (i = 0; i < nthreads; i++)
+ ASSERT_EQ(pthread_create(&threads[i], &pattr, mount_setattr_thread, NULL), 0);
+
+ for (j = 0; j < i; j++) {
+ void *retptr = NULL;
+
+ EXPECT_EQ(pthread_join(threads[j], &retptr), 0);
+
+ ret += ptr_to_int(retptr);
+ EXPECT_EQ(ret, 0);
+ }
+ pthread_attr_destroy(&pattr);
+
+ ASSERT_EQ(ret, 0);
+
+ expected_flags = old_flags;
+ expected_flags |= MS_RDONLY;
+ expected_flags |= MS_NOSUID;
+ new_flags = read_mnt_flags("/mnt/A");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ ASSERT_EQ(is_shared_mount("/mnt/A"), true);
+
+ new_flags = read_mnt_flags("/mnt/A/AA");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ ASSERT_EQ(is_shared_mount("/mnt/A/AA"), true);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ ASSERT_EQ(is_shared_mount("/mnt/A/AA/B"), true);
+
+ new_flags = read_mnt_flags("/mnt/A/AA/B/BB");
+ ASSERT_EQ(new_flags, expected_flags);
+
+ ASSERT_EQ(is_shared_mount("/mnt/A/AA/B/BB"), true);
+}
+
+TEST_F(mount_setattr, wrong_user_namespace)
+{
+ int ret;
+ struct mount_attr attr = {
+ .attr_set = MOUNT_ATTR_RDONLY,
+ };
+
+ if (!mount_setattr_supported())
+ SKIP(return, "mount_setattr syscall not supported");
+
+ EXPECT_EQ(create_and_enter_userns(), 0);
+ ret = sys_mount_setattr(-1, "/mnt/A", AT_RECURSIVE, &attr, sizeof(attr));
+ ASSERT_LT(ret, 0);
+ ASSERT_EQ(errno, EPERM);
+}
+
+TEST_F(mount_setattr, wrong_mount_namespace)
+{
+ int fd, ret;
+ struct mount_attr attr = {
+ .attr_set = MOUNT_ATTR_RDONLY,
+ };
+
+ if (!mount_setattr_supported())
+ SKIP(return, "mount_setattr syscall not supported");
+
+ fd = open("/mnt/A", O_DIRECTORY | O_CLOEXEC);
+ ASSERT_GE(fd, 0);
+
+ ASSERT_EQ(unshare(CLONE_NEWNS), 0);
+
+ ret = sys_mount_setattr(fd, "", AT_EMPTY_PATH | AT_RECURSIVE, &attr, sizeof(attr));
+ ASSERT_LT(ret, 0);
+ ASSERT_EQ(errno, EINVAL);
+}
+
+TEST_HARNESS_MAIN
--
2.29.2

2020-11-28 22:26:12

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v3 07/38] mount: attach mappings to mounts

In order to support per-mount idmappings vfsmounts will be marked with user
namespaces. The idmapping associated with that user namespace will be used to
map the ids of vfs objects when they are accessed through that mount. By
default all vfsmounts are marked with the initial user namespace. The initial
user namespace is used to indicate that a mount is not idmapped. All operations
behave as before.

Based on prior discussions we want to attach the whole user namespace and not
just a dedicated idmapping struct. This allows us to reuse all the helpers that
already exist for dealing with idmappings instead of introducing a whole new
range of helpers. In addition, if we decide in the future that we are confident
enough to enable unprivileged users to setup idmapped mounts we can allow
already idmapped mounts to be marked with another user namespace. The permission
checking would then take into account whether the caller is privileged in the
user namespace the mount is currently marked with. For now, we will enforce in
later patches that once a mount has been idmapped it can't be remapped. This
keeps permission checking and life-cycle management simple, especially since
users can always create a new mount with a different idmapping anyway.

The idea to attach user namespaces to vfsmounts has been floated around in
various forms at Linux Plumbers in ~2018 with the original idea tracing back to
a discussion during a conference in St. Petersburg between Christoph, Tycho, and
myself.

Cc: Christoph Hellwig <[email protected]>
Cc: David Howells <[email protected]>
Cc: Al Viro <[email protected]>
Cc: [email protected]
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
patch introduced
- Christoph Hellwig <[email protected]>:
- Split internal implementation into separate patch and move syscall
implementation later.

/* v3 */
- David Howells <[email protected]>:
- Remove MNT_IDMAPPED flag. We can simply check the pointer and use
smp_load_acquire() in later patches.

- Tycho Andersen <[email protected]>:
- Use READ_ONCE() in mnt_user_ns().
---
fs/namespace.c | 9 +++++++++
include/linux/fs.h | 1 +
include/linux/mount.h | 6 ++++++
3 files changed, 16 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index f9ea31b7eb7f..a1fda548c3af 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -220,6 +220,7 @@ static struct mount *alloc_vfsmnt(const char *name)
INIT_HLIST_NODE(&mnt->mnt_mp_list);
INIT_LIST_HEAD(&mnt->mnt_umounting);
INIT_HLIST_HEAD(&mnt->mnt_stuck_children);
+ mnt->mnt.mnt_user_ns = &init_user_ns;
}
return mnt;

@@ -559,6 +560,11 @@ int sb_prepare_remount_readonly(struct super_block *sb)

static void free_vfsmnt(struct mount *mnt)
{
+ struct user_namespace *ns;
+
+ ns = mnt_user_ns(&mnt->mnt);
+ if (ns != &init_user_ns)
+ put_user_ns(ns);
kfree_const(mnt->mnt_devname);
#ifdef CONFIG_SMP
free_percpu(mnt->mnt_pcp);
@@ -1067,6 +1073,9 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL);

atomic_inc(&sb->s_active);
+ mnt->mnt.mnt_user_ns = mnt_user_ns(&old->mnt);
+ if (mnt->mnt.mnt_user_ns != &init_user_ns)
+ mnt->mnt.mnt_user_ns = get_user_ns(mnt->mnt.mnt_user_ns);
mnt->mnt.mnt_sb = sb;
mnt->mnt.mnt_root = dget(root);
mnt->mnt_mountpoint = mnt->mnt.mnt_root;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f59b7f16f216..004686f49e32 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2276,6 +2276,7 @@ struct file_system_type {
#define FS_HAS_SUBTYPE 4
#define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
#define FS_DISALLOW_NOTIFY_PERM 16 /* Disable fanotify permission events */
+#define FS_ALLOW_IDMAP 32 /* FS has been updated to handle vfs idmappings. */
#define FS_THP_SUPPORT 8192 /* Remove once all fs converted */
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
int (*init_fs_context)(struct fs_context *);
diff --git a/include/linux/mount.h b/include/linux/mount.h
index aaf343b38671..d0d4a270acea 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -72,8 +72,14 @@ struct vfsmount {
struct dentry *mnt_root; /* root of the mounted tree */
struct super_block *mnt_sb; /* pointer to superblock */
int mnt_flags;
+ struct user_namespace *mnt_user_ns;
} __randomize_layout;

+static inline struct user_namespace *mnt_user_ns(const struct vfsmount *mnt)
+{
+ return READ_ONCE(mnt->mnt_user_ns);
+}
+
struct file; /* forward dec */
struct path;

--
2.29.2

2020-11-28 22:26:35

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v3 15/38] stat: handle idmapped mounts

The generic_fillattr() helper fills in the basic attributes associated with an
inode. Enable it to handle idmapped mounts. If the inode is accessed through an
idmapped mount we need to map it according to the mount's user namespace before
we store the uid and gid.
If the initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.

Cc: Christoph Hellwig <[email protected]>
Cc: David Howells <[email protected]>
Cc: Al Viro <[email protected]>
Cc: [email protected]
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
- Christoph Hellwig <[email protected]>:
- Don't pollute the vfs with additional helpers simply extend the existing
helpers with an additional argument and switch all callers.

/* v3 */
unchanged
---
fs/9p/vfs_inode.c | 4 ++--
fs/9p/vfs_inode_dotl.c | 4 ++--
fs/afs/inode.c | 2 +-
fs/btrfs/inode.c | 2 +-
fs/ceph/inode.c | 2 +-
fs/cifs/inode.c | 2 +-
fs/coda/inode.c | 2 +-
fs/ecryptfs/inode.c | 4 ++--
fs/erofs/inode.c | 2 +-
fs/exfat/file.c | 2 +-
fs/ext2/inode.c | 2 +-
fs/ext4/inode.c | 2 +-
fs/f2fs/file.c | 2 +-
fs/fat/file.c | 2 +-
fs/fuse/dir.c | 2 +-
fs/gfs2/inode.c | 2 +-
fs/hfsplus/inode.c | 2 +-
fs/kernfs/inode.c | 2 +-
fs/libfs.c | 4 ++--
fs/minix/inode.c | 2 +-
fs/nfs/inode.c | 2 +-
fs/nfs/namespace.c | 2 +-
fs/ocfs2/file.c | 2 +-
fs/orangefs/inode.c | 2 +-
fs/proc/base.c | 8 ++++----
fs/proc/generic.c | 2 +-
fs/proc/proc_net.c | 2 +-
fs/proc/proc_sysctl.c | 2 +-
fs/proc/root.c | 2 +-
fs/stat.c | 10 ++++++----
fs/sysv/itree.c | 2 +-
fs/ubifs/dir.c | 2 +-
fs/udf/symlink.c | 2 +-
fs/vboxsf/utils.c | 2 +-
include/linux/fs.h | 2 +-
mm/shmem.c | 2 +-
36 files changed, 48 insertions(+), 46 deletions(-)

diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 404526499c94..0a5c022c1c70 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -1006,7 +1006,7 @@ v9fs_vfs_getattr(const struct path *path, struct kstat *stat,
p9_debug(P9_DEBUG_VFS, "dentry: %p\n", dentry);
v9ses = v9fs_dentry2v9ses(dentry);
if (v9ses->cache == CACHE_LOOSE || v9ses->cache == CACHE_FSCACHE) {
- generic_fillattr(d_inode(dentry), stat);
+ generic_fillattr(&init_user_ns, d_inode(dentry), stat);
return 0;
}
fid = v9fs_fid_lookup(dentry);
@@ -1018,7 +1018,7 @@ v9fs_vfs_getattr(const struct path *path, struct kstat *stat,
return PTR_ERR(st);

v9fs_stat2inode(st, d_inode(dentry), dentry->d_sb, 0);
- generic_fillattr(d_inode(dentry), stat);
+ generic_fillattr(&init_user_ns, d_inode(dentry), stat);

p9stat_free(st);
kfree(st);
diff --git a/fs/9p/vfs_inode_dotl.c b/fs/9p/vfs_inode_dotl.c
index 282ec5cb45dc..8f3c1daf72ba 100644
--- a/fs/9p/vfs_inode_dotl.c
+++ b/fs/9p/vfs_inode_dotl.c
@@ -466,7 +466,7 @@ v9fs_vfs_getattr_dotl(const struct path *path, struct kstat *stat,
p9_debug(P9_DEBUG_VFS, "dentry: %p\n", dentry);
v9ses = v9fs_dentry2v9ses(dentry);
if (v9ses->cache == CACHE_LOOSE || v9ses->cache == CACHE_FSCACHE) {
- generic_fillattr(d_inode(dentry), stat);
+ generic_fillattr(&init_user_ns, d_inode(dentry), stat);
return 0;
}
fid = v9fs_fid_lookup(dentry);
@@ -482,7 +482,7 @@ v9fs_vfs_getattr_dotl(const struct path *path, struct kstat *stat,
return PTR_ERR(st);

v9fs_stat2inode_dotl(st, d_inode(dentry), 0);
- generic_fillattr(d_inode(dentry), stat);
+ generic_fillattr(&init_user_ns, d_inode(dentry), stat);
/* Change block size to what the server returned */
stat->blksize = st->st_blksize;

diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 0fe8844b4bee..17ecdee404eb 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -737,7 +737,7 @@ int afs_getattr(const struct path *path, struct kstat *stat,

do {
read_seqbegin_or_lock(&vnode->cb_lock, &seq);
- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
if (test_bit(AFS_VNODE_SILLY_DELETED, &vnode->flags) &&
stat->nlink > 0)
stat->nlink -= 1;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e6f4aed0d311..99b4fd66681d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8783,7 +8783,7 @@ static int btrfs_getattr(const struct path *path, struct kstat *stat,
STATX_ATTR_IMMUTABLE |
STATX_ATTR_NODUMP);

- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
stat->dev = BTRFS_I(inode)->root->anon_dev;

spin_lock(&BTRFS_I(inode)->lock);
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 19ec845ba5ec..decfa5e1ec4c 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -2380,7 +2380,7 @@ int ceph_getattr(const struct path *path, struct kstat *stat,
return err;
}

- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
stat->ino = ceph_present_inode(inode);

/*
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index 4d69b786e403..4fb3b6961096 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -2400,7 +2400,7 @@ int cifs_getattr(const struct path *path, struct kstat *stat,
return rc;
}

- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
stat->blksize = cifs_sb->bsize;
stat->ino = CIFS_I(inode)->uniqueid;

diff --git a/fs/coda/inode.c b/fs/coda/inode.c
index b1c70e2b9b1e..4d113e191cb8 100644
--- a/fs/coda/inode.c
+++ b/fs/coda/inode.c
@@ -256,7 +256,7 @@ int coda_getattr(const struct path *path, struct kstat *stat,
{
int err = coda_revalidate_inode(d_inode(path->dentry));
if (!err)
- generic_fillattr(d_inode(path->dentry), stat);
+ generic_fillattr(&init_user_ns, d_inode(path->dentry), stat);
return err;
}

diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index c35c2d2f3fe6..d98448c75051 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -975,7 +975,7 @@ static int ecryptfs_getattr_link(const struct path *path, struct kstat *stat,

mount_crypt_stat = &ecryptfs_superblock_to_private(
dentry->d_sb)->mount_crypt_stat;
- generic_fillattr(d_inode(dentry), stat);
+ generic_fillattr(&init_user_ns, d_inode(dentry), stat);
if (mount_crypt_stat->flags & ECRYPTFS_GLOBAL_ENCRYPT_FILENAMES) {
char *target;
size_t targetsiz;
@@ -1003,7 +1003,7 @@ static int ecryptfs_getattr(const struct path *path, struct kstat *stat,
if (!rc) {
fsstack_copy_attr_all(d_inode(dentry),
ecryptfs_inode_to_lower(d_inode(dentry)));
- generic_fillattr(d_inode(dentry), stat);
+ generic_fillattr(&init_user_ns, d_inode(dentry), stat);
stat->blocks = lower_stat.blocks;
}
return rc;
diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
index 3e21c0e8adae..083818063ac6 100644
--- a/fs/erofs/inode.c
+++ b/fs/erofs/inode.c
@@ -343,7 +343,7 @@ int erofs_getattr(const struct path *path, struct kstat *stat,
stat->attributes_mask |= (STATX_ATTR_COMPRESSED |
STATX_ATTR_IMMUTABLE);

- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
return 0;
}

diff --git a/fs/exfat/file.c b/fs/exfat/file.c
index ace35aa8e64b..e9705b3295d3 100644
--- a/fs/exfat/file.c
+++ b/fs/exfat/file.c
@@ -273,7 +273,7 @@ int exfat_getattr(const struct path *path, struct kstat *stat,
struct inode *inode = d_backing_inode(path->dentry);
struct exfat_inode_info *ei = EXFAT_I(inode);

- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
exfat_truncate_atime(&stat->atime);
stat->result_mask |= STATX_BTIME;
stat->btime.tv_sec = ei->i_crtime.tv_sec;
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 5ff1e5e3c0fe..7fcda76094ff 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1659,7 +1659,7 @@ int ext2_getattr(const struct path *path, struct kstat *stat,
STATX_ATTR_IMMUTABLE |
STATX_ATTR_NODUMP);

- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
return 0;
}

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f0487795afd1..d992fef2e7c2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5558,7 +5558,7 @@ int ext4_getattr(const struct path *path, struct kstat *stat,
STATX_ATTR_NODUMP |
STATX_ATTR_VERITY);

- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
return 0;
}

diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 3633a558ddcc..34f6522f5828 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -814,7 +814,7 @@ int f2fs_getattr(const struct path *path, struct kstat *stat,
STATX_ATTR_NODUMP |
STATX_ATTR_VERITY);

- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);

/* we need to show initial sectors used for inline_data/dentries */
if ((S_ISREG(inode->i_mode) && f2fs_has_inline_data(inode)) ||
diff --git a/fs/fat/file.c b/fs/fat/file.c
index 805b501467e9..f7e04f533d31 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -398,7 +398,7 @@ int fat_getattr(const struct path *path, struct kstat *stat,
u32 request_mask, unsigned int flags)
{
struct inode *inode = d_inode(path->dentry);
- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
stat->blksize = MSDOS_SB(inode->i_sb)->cluster_size;

if (MSDOS_SB(inode->i_sb)->options.nfs == FAT_NFS_NOSTALE_RO) {
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index dc6e8cc565d2..0750755685cd 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1064,7 +1064,7 @@ static int fuse_update_get_attr(struct inode *inode, struct file *file,
forget_all_cached_acls(inode);
err = fuse_do_getattr(inode, stat, file);
} else if (stat) {
- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
stat->mode = fi->orig_i_mode;
stat->ino = fi->orig_ino;
}
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index a29b66116c01..f1841667bcd9 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -2048,7 +2048,7 @@ static int gfs2_getattr(const struct path *path, struct kstat *stat,
STATX_ATTR_IMMUTABLE |
STATX_ATTR_NODUMP);

- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);

if (gfs2_holder_initialized(&gh))
gfs2_glock_dq_uninit(&gh);
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index a2789730f451..a34f37a53dcd 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -286,7 +286,7 @@ int hfsplus_getattr(const struct path *path, struct kstat *stat,
stat->attributes_mask |= STATX_ATTR_APPEND | STATX_ATTR_IMMUTABLE |
STATX_ATTR_NODUMP;

- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
return 0;
}

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 74ee0ffcfe71..062593491bbe 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -193,7 +193,7 @@ int kernfs_iop_getattr(const struct path *path, struct kstat *stat,
kernfs_refresh_inode(kn, inode);
mutex_unlock(&kernfs_mutex);

- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
return 0;
}

diff --git a/fs/libfs.c b/fs/libfs.c
index f50446576b10..6aa8bead838f 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -31,7 +31,7 @@ int simple_getattr(const struct path *path, struct kstat *stat,
u32 request_mask, unsigned int query_flags)
{
struct inode *inode = d_inode(path->dentry);
- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
stat->blocks = inode->i_mapping->nrpages << (PAGE_SHIFT - 9);
return 0;
}
@@ -1302,7 +1302,7 @@ static int empty_dir_getattr(const struct path *path, struct kstat *stat,
u32 request_mask, unsigned int query_flags)
{
struct inode *inode = d_inode(path->dentry);
- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
return 0;
}

diff --git a/fs/minix/inode.c b/fs/minix/inode.c
index 34f546404aa1..91c81d2fc90d 100644
--- a/fs/minix/inode.c
+++ b/fs/minix/inode.c
@@ -658,7 +658,7 @@ int minix_getattr(const struct path *path, struct kstat *stat,
struct super_block *sb = path->dentry->d_sb;
struct inode *inode = d_inode(path->dentry);

- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
if (INODE_VERSION(inode) == MINIX_V1)
stat->blocks = (BLOCK_SIZE / 512) * V1_minix_blocks(stat->size, sb);
else
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index aa6493905bbe..8d74974d7992 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -858,7 +858,7 @@ int nfs_getattr(const struct path *path, struct kstat *stat,
/* Only return attributes that were revalidated. */
stat->result_mask &= request_mask;
out_no_update:
- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
if (S_ISDIR(inode->i_mode))
stat->blksize = NFS_SERVER(inode)->dtsize;
diff --git a/fs/nfs/namespace.c b/fs/nfs/namespace.c
index 2bcbe38afe2e..55fc711e368b 100644
--- a/fs/nfs/namespace.c
+++ b/fs/nfs/namespace.c
@@ -213,7 +213,7 @@ nfs_namespace_getattr(const struct path *path, struct kstat *stat,
{
if (NFS_FH(d_inode(path->dentry))->size != 0)
return nfs_getattr(path, stat, request_mask, query_flags);
- generic_fillattr(d_inode(path->dentry), stat);
+ generic_fillattr(&init_user_ns, d_inode(path->dentry), stat);
return 0;
}

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index cabf355b148f..a070d4c9b6ed 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1313,7 +1313,7 @@ int ocfs2_getattr(const struct path *path, struct kstat *stat,
goto bail;
}

- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
/*
* If there is inline data in the inode, the inode will normally not
* have data blocks allocated (it may have an external xattr block).
diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
index 563fe9ab8eb2..b94032f77e61 100644
--- a/fs/orangefs/inode.c
+++ b/fs/orangefs/inode.c
@@ -903,7 +903,7 @@ int orangefs_getattr(const struct path *path, struct kstat *stat,
ret = orangefs_inode_getattr(inode,
request_mask & STATX_SIZE ? ORANGEFS_GETATTR_SIZE : 0);
if (ret == 0) {
- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);

/* override block size reported to stat */
if (!(request_mask & STATX_SIZE))
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 266b54b75c4c..c711d6fb3051 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1934,7 +1934,7 @@ int pid_getattr(const struct path *path, struct kstat *stat,
struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
struct task_struct *task;

- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);

stat->uid = GLOBAL_ROOT_UID;
stat->gid = GLOBAL_ROOT_GID;
@@ -2620,7 +2620,7 @@ static struct dentry *proc_pident_instantiate(struct dentry *dentry,
return d_splice_alias(inode, dentry);
}

-static struct dentry *proc_pident_lookup(struct inode *dir,
+static struct dentry *proc_pident_lookup(struct inode *dir,
struct dentry *dentry,
const struct pid_entry *p,
const struct pid_entry *end)
@@ -2820,7 +2820,7 @@ static const struct pid_entry attr_dir_stuff[] = {

static int proc_attr_dir_readdir(struct file *file, struct dir_context *ctx)
{
- return proc_pident_readdir(file, ctx,
+ return proc_pident_readdir(file, ctx,
attr_dir_stuff, ARRAY_SIZE(attr_dir_stuff));
}

@@ -3797,7 +3797,7 @@ static int proc_task_getattr(const struct path *path, struct kstat *stat,
{
struct inode *inode = d_inode(path->dentry);
struct task_struct *p = get_proc_task(inode);
- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);

if (p) {
stat->nlink += get_nr_threads(p);
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 896f2ebde75c..5c6345c68fc5 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -145,7 +145,7 @@ static int proc_getattr(const struct path *path, struct kstat *stat,
}
}

- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
return 0;
}

diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index ed8a6306990c..bc555b7da6dc 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -313,7 +313,7 @@ static int proc_tgid_net_getattr(const struct path *path, struct kstat *stat,

net = get_proc_task_net(inode);

- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);

if (net != NULL) {
stat->nlink = net->proc_net->nlink;
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index ec67dbc1f705..87c828348140 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -840,7 +840,7 @@ static int proc_sys_getattr(const struct path *path, struct kstat *stat,
if (IS_ERR(head))
return PTR_ERR(head);

- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
if (table)
stat->mode = (stat->mode & S_IFMT) | table->mode;

diff --git a/fs/proc/root.c b/fs/proc/root.c
index 5e444d4f9717..244e4b6f15ef 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -311,7 +311,7 @@ void __init proc_root_init(void)
static int proc_root_getattr(const struct path *path, struct kstat *stat,
u32 request_mask, unsigned int query_flags)
{
- generic_fillattr(d_inode(path->dentry), stat);
+ generic_fillattr(&init_user_ns, d_inode(path->dentry), stat);
stat->nlink = proc_root.nlink + nr_processes();
return 0;
}
diff --git a/fs/stat.c b/fs/stat.c
index dacecdda2e79..868e39e101bc 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -26,6 +26,7 @@

/**
* generic_fillattr - Fill in the basic attributes from the inode struct
+ * @user_ns: the user namespace from which we access this inode
* @inode: Inode to use as the source
* @stat: Where to fill in the attributes
*
@@ -33,14 +34,15 @@
* found on the VFS inode structure. This is the default if no getattr inode
* operation is supplied.
*/
-void generic_fillattr(struct inode *inode, struct kstat *stat)
+void generic_fillattr(struct user_namespace *mnt_user_ns, struct inode *inode,
+ struct kstat *stat)
{
stat->dev = inode->i_sb->s_dev;
stat->ino = inode->i_ino;
stat->mode = inode->i_mode;
stat->nlink = inode->i_nlink;
- stat->uid = inode->i_uid;
- stat->gid = inode->i_gid;
+ stat->uid = i_uid_into_mnt(mnt_user_ns, inode);
+ stat->gid = i_gid_into_mnt(mnt_user_ns, inode);
stat->rdev = inode->i_rdev;
stat->size = i_size_read(inode);
stat->atime = inode->i_atime;
@@ -87,7 +89,7 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
return inode->i_op->getattr(path, stat, request_mask,
query_flags);

- generic_fillattr(inode, stat);
+ generic_fillattr(mnt_user_ns(path->mnt), inode, stat);
return 0;
}
EXPORT_SYMBOL(vfs_getattr_nosec);
diff --git a/fs/sysv/itree.c b/fs/sysv/itree.c
index bcb67b0cabe7..83cffab6955f 100644
--- a/fs/sysv/itree.c
+++ b/fs/sysv/itree.c
@@ -445,7 +445,7 @@ int sysv_getattr(const struct path *path, struct kstat *stat,
u32 request_mask, unsigned int flags)
{
struct super_block *s = path->dentry->d_sb;
- generic_fillattr(d_inode(path->dentry), stat);
+ generic_fillattr(&init_user_ns, d_inode(path->dentry), stat);
stat->blocks = (s->s_blocksize / 512) * sysv_nblocks(s, stat->size);
stat->blksize = s->s_blocksize;
return 0;
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index 1639331f9543..144422c39125 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -1579,7 +1579,7 @@ int ubifs_getattr(const struct path *path, struct kstat *stat,
STATX_ATTR_ENCRYPTED |
STATX_ATTR_IMMUTABLE);

- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
stat->blksize = UBIFS_BLOCK_SIZE;
stat->size = ui->ui_size;

diff --git a/fs/udf/symlink.c b/fs/udf/symlink.c
index c973db239604..54a44d1f023c 100644
--- a/fs/udf/symlink.c
+++ b/fs/udf/symlink.c
@@ -159,7 +159,7 @@ static int udf_symlink_getattr(const struct path *path, struct kstat *stat,
struct inode *inode = d_backing_inode(dentry);
struct page *page;

- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);
page = read_mapping_page(inode->i_mapping, 0, NULL);
if (IS_ERR(page))
return PTR_ERR(page);
diff --git a/fs/vboxsf/utils.c b/fs/vboxsf/utils.c
index 018057546067..d2cd1c99f48e 100644
--- a/fs/vboxsf/utils.c
+++ b/fs/vboxsf/utils.c
@@ -233,7 +233,7 @@ int vboxsf_getattr(const struct path *path, struct kstat *kstat,
if (err)
return err;

- generic_fillattr(d_inode(dentry), kstat);
+ generic_fillattr(&init_user_ns, d_inode(dentry), kstat);
return 0;
}

diff --git a/include/linux/fs.h b/include/linux/fs.h
index ef6917e1ec75..696ad3e0df03 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3139,7 +3139,7 @@ extern int __page_symlink(struct inode *inode, const char *symname, int len,
extern int page_symlink(struct inode *inode, const char *symname, int len);
extern const struct inode_operations page_symlink_inode_operations;
extern void kfree_link(void *);
-extern void generic_fillattr(struct inode *, struct kstat *);
+extern void generic_fillattr(struct user_namespace *, struct inode *, struct kstat *);
extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);
extern int vfs_getattr(const struct path *, struct kstat *, u32, unsigned int);
void __inode_add_bytes(struct inode *inode, loff_t bytes);
diff --git a/mm/shmem.c b/mm/shmem.c
index 8fdf52576e06..7c317b0d06f0 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1072,7 +1072,7 @@ static int shmem_getattr(const struct path *path, struct kstat *stat,
shmem_recalc_inode(inode);
spin_unlock_irq(&info->lock);
}
- generic_fillattr(inode, stat);
+ generic_fillattr(&init_user_ns, inode, stat);

if (is_huge_enabled(sb_info))
stat->blksize = HPAGE_PMD_SIZE;
--
2.29.2

2020-11-28 22:57:45

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH v3 00/38] idmapped mounts

On Sat, Nov 28, 2020 at 10:34:49PM +0100, Christian Brauner wrote:
> Hey everyone,

Hey Christian,

a general request. Argue with me if it seems misguided.

When looking at a patch or a small hunk of code, these days, if a variable
called 'ns' or 'user_ns' is seen passed to a function, it can be easy to
assume which user_ns it is based on what you think would make sense, but if
your assumption is wrong, your patch review will be wrong.

Can we stick to a convention where we have maybe

subj_userns - the userns of the task seeking some action
obj_userns - the userns of the thing being acted on - task, superblock,...
mnt_userns - the userns of a mountpoint through which an object is seen

You're replacing a lot of such callers and callsites in this patchset, so
this would be a great time to start doing that.

-serge

2020-12-01 10:23:24

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v3 00/38] idmapped mounts

On Sat, Nov 28, 2020 at 04:54:50PM -0600, Serge Hallyn wrote:
> On Sat, Nov 28, 2020 at 10:34:49PM +0100, Christian Brauner wrote:
> > Hey everyone,
>
> Hey Christian,
>
> a general request. Argue with me if it seems misguided.
>
> When looking at a patch or a small hunk of code, these days, if a variable
> called 'ns' or 'user_ns' is seen passed to a function, it can be easy to
> assume which user_ns it is based on what you think would make sense, but if
> your assumption is wrong, your patch review will be wrong.
>
> Can we stick to a convention where we have maybe
>
> subj_userns - the userns of the task seeking some action
> obj_userns - the userns of the thing being acted on - task, superblock,...
> mnt_userns - the userns of a mountpoint through which an object is seen
>
> You're replacing a lot of such callers and callsites in this patchset, so
> this would be a great time to start doing that.

Hey Serge,

this makes a lot of sense. I'll convert all accesses to the vfsmount's
userns we're introducing in this series to to mnt_userns at least.

Christian

2020-12-01 10:52:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 07/38] mount: attach mappings to mounts

The READ_ONCE still looks suspect as it generally needs to be paired
with a WRITE_ONCE. The rest looks sane to me.

2020-12-01 10:52:53

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 04/38] fs: add mount_setattr()

Lots of crazy long lines in the patch. Remember that you should only
go past 80 lines if it clearly improves readability, and I don't
think it does anywhere in here.

> index a7cd0f64faa4..a5a6c470dc07 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -82,6 +82,14 @@ int may_linkat(struct path *link);
> /*
> * namespace.c
> */
> +struct mount_kattr {
> + unsigned int attr_set;
> + unsigned int attr_clr;
> + unsigned int propagation;
> + unsigned int lookup_flags;
> + bool recurse;
> +};

Even with the whole series applied this structure is only used in
namespace.c, so it might be worth moving there.

> +static inline int mnt_hold_writers(struct mount *mnt)
> {
> - int ret = 0;
> -
> mnt->mnt.mnt_flags |= MNT_WRITE_HOLD;
> /*
> * After storing MNT_WRITE_HOLD, we'll read the counters. This store
> @@ -497,15 +495,29 @@ static int mnt_make_readonly(struct mount *mnt)
> * we're counting up here.
> */
> if (mnt_get_writers(mnt) > 0)
> - ret = -EBUSY;
> - else
> - mnt->mnt.mnt_flags |= MNT_READONLY;
> + return -EBUSY;
> +
> + return 0;
> +}
> +
> +static inline void mnt_unhold_writers(struct mount *mnt)
> +{
> /*
> * MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers
> * that become unheld will see MNT_READONLY.
> */
> smp_wmb();
> mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
> +}
> +
> +static int mnt_make_readonly(struct mount *mnt)
> +{
> + int ret;
> +
> + ret = mnt_hold_writers(mnt);
> + if (!ret)
> + mnt->mnt.mnt_flags |= MNT_READONLY;
> + mnt_unhold_writers(mnt);
> return ret;
> }
>
> @@ -3438,6 +3450,33 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
> return ret;
> }

This refactoring seems worth a little prep patch.

>
> +static int build_attr_flags(unsigned int attr_flags, unsigned int *flags)
> +{
> + unsigned int aflags = 0;
> +
> + if (attr_flags & ~(MOUNT_ATTR_RDONLY |
> + MOUNT_ATTR_NOSUID |
> + MOUNT_ATTR_NODEV |
> + MOUNT_ATTR_NOEXEC |
> + MOUNT_ATTR__ATIME |
> + MOUNT_ATTR_NODIRATIME))
> + return -EINVAL;
> +
> + if (attr_flags & MOUNT_ATTR_RDONLY)
> + aflags |= MNT_READONLY;
> + if (attr_flags & MOUNT_ATTR_NOSUID)
> + aflags |= MNT_NOSUID;
> + if (attr_flags & MOUNT_ATTR_NODEV)
> + aflags |= MNT_NODEV;
> + if (attr_flags & MOUNT_ATTR_NOEXEC)
> + aflags |= MNT_NOEXEC;
> + if (attr_flags & MOUNT_ATTR_NODIRATIME)
> + aflags |= MNT_NODIRATIME;
> +
> + *flags = aflags;
> + return 0;
> +}

Same for adding this helper.

> + *kattr = (struct mount_kattr){

Missing whitespace before the {.

> + switch (attr->propagation) {
> + case MAKE_PROPAGATION_UNCHANGED:
> + kattr->propagation = 0;
> + break;
> + case MAKE_PROPAGATION_UNBINDABLE:
> + kattr->propagation = MS_UNBINDABLE;
> + break;
> + case MAKE_PROPAGATION_PRIVATE:
> + kattr->propagation = MS_PRIVATE;
> + break;
> + case MAKE_PROPAGATION_DEPENDENT:
> + kattr->propagation = MS_SLAVE;
> + break;
> + case MAKE_PROPAGATION_SHARED:
> + kattr->propagation = MS_SHARED;
> + break;
> + default:

Any reason to not just reuse the MS_* flags in the new API? Yes, your
new names are more descriptive, but having different names for the same
thing is also rather confusing.

> + if (upper_32_bits(attr->attr_set))
> + return -EINVAL;
> + if (build_attr_flags(lower_32_bits(attr->attr_set), &kattr->attr_set))
> + return -EINVAL;
> +
> + if (upper_32_bits(attr->attr_clr))
> + return -EINVAL;
> + if (build_attr_flags(lower_32_bits(attr->attr_clr), &kattr->attr_clr))
> + return -EINVAL;

What is so magic about the upper and lower 32 bits?

> + return -EINVAL;
> + else if ((attr->attr_clr & MOUNT_ATTR__ATIME) &&
> + ((attr->attr_clr & MOUNT_ATTR__ATIME) != MOUNT_ATTR__ATIME))
> + return -EINVAL;

No need for the else here.

That being said I'd reword the thing to be a little more obvious:

if (attr->attr_clr & MOUNT_ATTR__ATIME) {
if ((attr->attr_clr & MOUNT_ATTR__ATIME) != MOUNT_ATTR__ATIME)
return -EINVAL;

... code doing the update of the atime flags here
} else {
if (attr->attr_set & MOUNT_ATTR__ATIME)
return -EINVAL;
}


> +/* Change propagation through mount_setattr(). */
> +enum propagation_type {
> + MAKE_PROPAGATION_UNCHANGED = 0, /* Don't change mount propagation (default). */
> + MAKE_PROPAGATION_UNBINDABLE = 1, /* Make unbindable. */
> + MAKE_PROPAGATION_PRIVATE = 2, /* Do not receive or send mount events. */
> + MAKE_PROPAGATION_DEPENDENT = 3, /* Only receive mount events. */
> + MAKE_PROPAGATION_SHARED = 4, /* Send and receive mount events. */
> +};

FYI, in uapis using defines instead of enums is usually the better
choice, as that allows userspace to probe for later added defines.

But if we use MS_* here that would be void anyway.

> +/* List of all mount_attr versions. */
> +#define MOUNT_ATTR_SIZE_VER0 24 /* sizeof first published struct */
> +#define MOUNT_ATTR_SIZE_LATEST MOUNT_ATTR_SIZE_VER0

The _LATEST things is pretty dangerous as there basically is no safe
and correct way for userspace to use it.

2020-12-01 13:27:45

by Tycho Andersen

[permalink] [raw]
Subject: Re: [PATCH v3 07/38] mount: attach mappings to mounts

On Tue, Dec 01, 2020 at 11:50:25AM +0100, Christoph Hellwig wrote:
> The READ_ONCE still looks suspect as it generally needs to be paired
> with a WRITE_ONCE. The rest looks sane to me.

Yeah, the comment from the other location is,

/* Pairs with smp_load_acquire() in mnt_user_ns(). */

So I think it's just a typo that needs to be fixed.

Tycho

2020-12-02 09:26:15

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v3 07/38] mount: attach mappings to mounts

On Tue, Dec 01, 2020 at 11:50:25AM +0100, Christoph Hellwig wrote:
> The READ_ONCE still looks suspect as it generally needs to be paired
> with a WRITE_ONCE. The rest looks sane to me.

I think the READ_ONCE() can be dropped from this patch. At this point in
the series we don't allowing changing the vfsmount's userns. The infra
to do that is only introduced as almost the last patch in the series and
there we immediately use smp_load_acquire() and smp_store_release().

Thanks!
Christian

2020-12-02 09:43:44

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v3 04/38] fs: add mount_setattr()

On Tue, Dec 01, 2020 at 11:49:07AM +0100, Christoph Hellwig wrote:

Sorry for not responding to this yesterday. I missed most of your mails
because they have been filtered into a dedicated folder (as they should
be) and I would've looked into that folder but somehow gmail let ~3
mails of you into my general inbox and so I didn't bother...

> Lots of crazy long lines in the patch. Remember that you should only
> go past 80 lines if it clearly improves readability, and I don't
> think it does anywhere in here.

Weird, I did reformat the patch to the 80 char limit and I have dual
display in vim, meaning I have a visible line at 80 chars and 100 chars
whenever I edit a file. I'll go through it again, thanks!


>
> > index a7cd0f64faa4..a5a6c470dc07 100644
> > --- a/fs/internal.h
> > +++ b/fs/internal.h
> > @@ -82,6 +82,14 @@ int may_linkat(struct path *link);
> > /*
> > * namespace.c
> > */
> > +struct mount_kattr {
> > + unsigned int attr_set;
> > + unsigned int attr_clr;
> > + unsigned int propagation;
> > + unsigned int lookup_flags;
> > + bool recurse;
> > +};
>
> Even with the whole series applied this structure is only used in
> namespace.c, so it might be worth moving there.

Good point. Will do.

>
> > +static inline int mnt_hold_writers(struct mount *mnt)
> > {
> > - int ret = 0;
> > -
> > mnt->mnt.mnt_flags |= MNT_WRITE_HOLD;
> > /*
> > * After storing MNT_WRITE_HOLD, we'll read the counters. This store
> > @@ -497,15 +495,29 @@ static int mnt_make_readonly(struct mount *mnt)
> > * we're counting up here.
> > */
> > if (mnt_get_writers(mnt) > 0)
> > - ret = -EBUSY;
> > - else
> > - mnt->mnt.mnt_flags |= MNT_READONLY;
> > + return -EBUSY;
> > +
> > + return 0;
> > +}
> > +
> > +static inline void mnt_unhold_writers(struct mount *mnt)
> > +{
> > /*
> > * MNT_READONLY must become visible before ~MNT_WRITE_HOLD, so writers
> > * that become unheld will see MNT_READONLY.
> > */
> > smp_wmb();
> > mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
> > +}
> > +
> > +static int mnt_make_readonly(struct mount *mnt)
> > +{
> > + int ret;
> > +
> > + ret = mnt_hold_writers(mnt);
> > + if (!ret)
> > + mnt->mnt.mnt_flags |= MNT_READONLY;
> > + mnt_unhold_writers(mnt);
> > return ret;
> > }
> >
> > @@ -3438,6 +3450,33 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
> > return ret;
> > }
>
> This refactoring seems worth a little prep patch.

Will split into separate patch.

>
> >
> > +static int build_attr_flags(unsigned int attr_flags, unsigned int *flags)
> > +{
> > + unsigned int aflags = 0;
> > +
> > + if (attr_flags & ~(MOUNT_ATTR_RDONLY |
> > + MOUNT_ATTR_NOSUID |
> > + MOUNT_ATTR_NODEV |
> > + MOUNT_ATTR_NOEXEC |
> > + MOUNT_ATTR__ATIME |
> > + MOUNT_ATTR_NODIRATIME))
> > + return -EINVAL;
> > +
> > + if (attr_flags & MOUNT_ATTR_RDONLY)
> > + aflags |= MNT_READONLY;
> > + if (attr_flags & MOUNT_ATTR_NOSUID)
> > + aflags |= MNT_NOSUID;
> > + if (attr_flags & MOUNT_ATTR_NODEV)
> > + aflags |= MNT_NODEV;
> > + if (attr_flags & MOUNT_ATTR_NOEXEC)
> > + aflags |= MNT_NOEXEC;
> > + if (attr_flags & MOUNT_ATTR_NODIRATIME)
> > + aflags |= MNT_NODIRATIME;
> > +
> > + *flags = aflags;
> > + return 0;
> > +}
>
> Same for adding this helper.

Will do.

>
> > + *kattr = (struct mount_kattr){
>
> Missing whitespace before the {.

Good spot, thank you!

>
> > + switch (attr->propagation) {
> > + case MAKE_PROPAGATION_UNCHANGED:
> > + kattr->propagation = 0;
> > + break;
> > + case MAKE_PROPAGATION_UNBINDABLE:
> > + kattr->propagation = MS_UNBINDABLE;
> > + break;
> > + case MAKE_PROPAGATION_PRIVATE:
> > + kattr->propagation = MS_PRIVATE;
> > + break;
> > + case MAKE_PROPAGATION_DEPENDENT:
> > + kattr->propagation = MS_SLAVE;
> > + break;
> > + case MAKE_PROPAGATION_SHARED:
> > + kattr->propagation = MS_SHARED;
> > + break;
> > + default:
>
> Any reason to not just reuse the MS_* flags in the new API? Yes, your
> new names are more descriptive, but having different names for the same
> thing is also rather confusing.

I'm not really married to this so I don't see a reason why not.

>
> > + if (upper_32_bits(attr->attr_set))
> > + return -EINVAL;
> > + if (build_attr_flags(lower_32_bits(attr->attr_set), &kattr->attr_set))
> > + return -EINVAL;
> > +
> > + if (upper_32_bits(attr->attr_clr))
> > + return -EINVAL;
> > + if (build_attr_flags(lower_32_bits(attr->attr_clr), &kattr->attr_clr))
> > + return -EINVAL;
>
> What is so magic about the upper and lower 32 bits?

Nothing apart from the fact that they arent't currently valid. I can
think about reworking these lines. Or do you already have a preferred
way of doing this in mind?

>
> > + return -EINVAL;
> > + else if ((attr->attr_clr & MOUNT_ATTR__ATIME) &&
> > + ((attr->attr_clr & MOUNT_ATTR__ATIME) != MOUNT_ATTR__ATIME))
> > + return -EINVAL;
>
> No need for the else here.

Thanks!

>
> That being said I'd reword the thing to be a little more obvious:
>
> if (attr->attr_clr & MOUNT_ATTR__ATIME) {
> if ((attr->attr_clr & MOUNT_ATTR__ATIME) != MOUNT_ATTR__ATIME)
> return -EINVAL;
>
> ... code doing the update of the atime flags here
> } else {
> if (attr->attr_set & MOUNT_ATTR__ATIME)
> return -EINVAL;
> }

Will do.

>
>
> > +/* Change propagation through mount_setattr(). */
> > +enum propagation_type {
> > + MAKE_PROPAGATION_UNCHANGED = 0, /* Don't change mount propagation (default). */
> > + MAKE_PROPAGATION_UNBINDABLE = 1, /* Make unbindable. */
> > + MAKE_PROPAGATION_PRIVATE = 2, /* Do not receive or send mount events. */
> > + MAKE_PROPAGATION_DEPENDENT = 3, /* Only receive mount events. */
> > + MAKE_PROPAGATION_SHARED = 4, /* Send and receive mount events. */
> > +};
>
> FYI, in uapis using defines instead of enums is usually the better
> choice, as that allows userspace to probe for later added defines.
>
> But if we use MS_* here that would be void anyway.

Indeed.

>
> > +/* List of all mount_attr versions. */
> > +#define MOUNT_ATTR_SIZE_VER0 24 /* sizeof first published struct */
> > +#define MOUNT_ATTR_SIZE_LATEST MOUNT_ATTR_SIZE_VER0
>
> The _LATEST things is pretty dangerous as there basically is no safe
> and correct way for userspace to use it.

Ok, I'll remove the _LATEST.

Thanks for the review (and sorry again for missing your mails)!

Christian

2020-12-02 09:51:23

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 04/38] fs: add mount_setattr()

On Wed, Dec 02, 2020 at 10:42:18AM +0100, Christian Brauner wrote:
> > > + if (upper_32_bits(attr->attr_set))
> > > + return -EINVAL;
> > > + if (build_attr_flags(lower_32_bits(attr->attr_set), &kattr->attr_set))
> > > + return -EINVAL;
> > > +
> > > + if (upper_32_bits(attr->attr_clr))
> > > + return -EINVAL;
> > > + if (build_attr_flags(lower_32_bits(attr->attr_clr), &kattr->attr_clr))
> > > + return -EINVAL;
> >
> > What is so magic about the upper and lower 32 bits?
>
> Nothing apart from the fact that they arent't currently valid. I can
> think about reworking these lines. Or do you already have a preferred
> way of doing this in mind?

Just turn the attr_flags argument to build_attr_flags into a u64 and
the first sanity check there will catch all invalid flags, no matter
where they are places. That should also generate more efficient code.

2020-12-02 09:57:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 04/38] fs: add mount_setattr()

On Wed, Dec 02, 2020 at 10:47:51AM +0100, Christoph Hellwig wrote:
> On Wed, Dec 02, 2020 at 10:42:18AM +0100, Christian Brauner wrote:
> > > > + if (upper_32_bits(attr->attr_set))
> > > > + return -EINVAL;
> > > > + if (build_attr_flags(lower_32_bits(attr->attr_set), &kattr->attr_set))
> > > > + return -EINVAL;
> > > > +
> > > > + if (upper_32_bits(attr->attr_clr))
> > > > + return -EINVAL;
> > > > + if (build_attr_flags(lower_32_bits(attr->attr_clr), &kattr->attr_clr))
> > > > + return -EINVAL;
> > >
> > > What is so magic about the upper and lower 32 bits?
> >
> > Nothing apart from the fact that they arent't currently valid. I can
> > think about reworking these lines. Or do you already have a preferred
> > way of doing this in mind?
>
> Just turn the attr_flags argument to build_attr_flags into a u64 and
> the first sanity check there will catch all invalid flags, no matter
> where they are places. That should also generate more efficient code.

And while we're at it: the check for valid flags in the current
code is a little weird, given that build_attr_flags checks for
them, and but sys_fsmount also has its own slightly narrower checks.

I think it might make sense to split the validity check out of
build_attr_flags. E.g. something like:

static unsigned int attr_flags_to_mnt_flags(u64 attr_flags)
{
unsigned int mnt_flags = 0;

if (attr_flags & MOUNT_ATTR_RDONLY)
mnt_flags |= MNT_READONLY;
if (attr_flags & MOUNT_ATTR_NOSUID)
mnt_flags |= MNT_NOSUID;
if (attr_flags & MOUNT_ATTR_NODEV)
mnt_flags |= MNT_NODEV;
if (attr_flags & MOUNT_ATTR_NOEXEC)
mnt_flags |= MNT_NOEXEC;
if (attr_flags & MOUNT_ATTR_NODIRATIME)
mnt_flags |= MNT_NODIRATIME;

return mnt_flags;
}

#define MOUNT_SETATTR_VALID_FLAGS \
(MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID | MOUNT_ATTR_NODEV | \
MOUNT_ATTR_NOEXEC | MOUNT_ATTR__ATIME | MOUNT_ATTR_NODIRATIME | \
MOUNT_ATTR_IDMAP)

static int build_mount_kattr(const struct mount_attr *attr, size_t usize,
struct mount_kattr *kattr, unsigned int flags)
{
...

if ((attr->attr_set | attr->attr_clr) & ~MOUNT_SETATTR_VALID_FLAGS)
return -EINVAL;
kattr->attr_set = attr_flags_to_mnt_flags(attr->attr_set);
kattr->attr_clr = attr_flags_to_mnt_flags(attr->attr_clr);
...
}

2020-12-02 10:00:05

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v3 04/38] fs: add mount_setattr()

On Wed, Dec 02, 2020 at 10:55:47AM +0100, Christoph Hellwig wrote:
> On Wed, Dec 02, 2020 at 10:47:51AM +0100, Christoph Hellwig wrote:
> > On Wed, Dec 02, 2020 at 10:42:18AM +0100, Christian Brauner wrote:
> > > > > + if (upper_32_bits(attr->attr_set))
> > > > > + return -EINVAL;
> > > > > + if (build_attr_flags(lower_32_bits(attr->attr_set), &kattr->attr_set))
> > > > > + return -EINVAL;
> > > > > +
> > > > > + if (upper_32_bits(attr->attr_clr))
> > > > > + return -EINVAL;
> > > > > + if (build_attr_flags(lower_32_bits(attr->attr_clr), &kattr->attr_clr))
> > > > > + return -EINVAL;
> > > >
> > > > What is so magic about the upper and lower 32 bits?
> > >
> > > Nothing apart from the fact that they arent't currently valid. I can
> > > think about reworking these lines. Or do you already have a preferred
> > > way of doing this in mind?
> >
> > Just turn the attr_flags argument to build_attr_flags into a u64 and
> > the first sanity check there will catch all invalid flags, no matter
> > where they are places. That should also generate more efficient code.
>
> And while we're at it: the check for valid flags in the current
> code is a little weird, given that build_attr_flags checks for
> them, and but sys_fsmount also has its own slightly narrower checks.
>
> I think it might make sense to split the validity check out of
> build_attr_flags. E.g. something like:

Sounds good!
If we make this a preparatory patch do you want to be recorded in the
author field?

>
> static unsigned int attr_flags_to_mnt_flags(u64 attr_flags)
> {
> unsigned int mnt_flags = 0;
>
> if (attr_flags & MOUNT_ATTR_RDONLY)
> mnt_flags |= MNT_READONLY;
> if (attr_flags & MOUNT_ATTR_NOSUID)
> mnt_flags |= MNT_NOSUID;
> if (attr_flags & MOUNT_ATTR_NODEV)
> mnt_flags |= MNT_NODEV;
> if (attr_flags & MOUNT_ATTR_NOEXEC)
> mnt_flags |= MNT_NOEXEC;
> if (attr_flags & MOUNT_ATTR_NODIRATIME)
> mnt_flags |= MNT_NODIRATIME;
>
> return mnt_flags;
> }
>
> #define MOUNT_SETATTR_VALID_FLAGS \
> (MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID | MOUNT_ATTR_NODEV | \
> MOUNT_ATTR_NOEXEC | MOUNT_ATTR__ATIME | MOUNT_ATTR_NODIRATIME | \
> MOUNT_ATTR_IDMAP)
>
> static int build_mount_kattr(const struct mount_attr *attr, size_t usize,
> struct mount_kattr *kattr, unsigned int flags)
> {
> ...
>
> if ((attr->attr_set | attr->attr_clr) & ~MOUNT_SETATTR_VALID_FLAGS)
> return -EINVAL;
> kattr->attr_set = attr_flags_to_mnt_flags(attr->attr_set);
> kattr->attr_clr = attr_flags_to_mnt_flags(attr->attr_clr);
> ...
> }

2020-12-02 10:03:28

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 04/38] fs: add mount_setattr()

On Wed, Dec 02, 2020 at 10:57:45AM +0100, Christian Brauner wrote:
> Sounds good!
> If we make this a preparatory patch do you want to be recorded in the
> author field?

No need to record me in any way. This was just whiteboard coding.