2020-10-29 08:36:08

by Christian Brauner

[permalink] [raw]
Subject: [PATCH 00/34] fs: idmapped mounts

Hey everyone,

I vanished for a little while to focus on this work here so sorry for
not being available by mail for a while.

Since quite a long time we have issues with sharing mounts between
multiple unprivileged containers with different id mappings, sharing a
rootfs between multiple containers with different id mappings, and also
sharing regular directories and filesystems between users with different
uids and gids. The latter use-cases have become even more important with
the availability and adoption of systemd-homed (cf. [1]) to implement
portable home directories.

The solutions we have tried and proposed so far include the introduction
of fsid mappings, a tiny overlay based filesystem, and an approach to
call override creds in the vfs. None of these solutions have covered all
of the above use-cases.

The solution proposed here has it's origins in multiple discussions
during Linux Plumbers 2017 during and after the end of the containers
microconference.
To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
James, and myself. A variant of the solution proposed here has also been
discussed, again to the best of my knowledge, after a Linux conference
in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
after Linux Plumbers.
I've taken the time to finally implement a working version of this
solution over the last weeks to the best of my abilities. Tycho has
signed up for this sligthly crazy endeavour as well and he has helped
with the conversion of the xattr codepaths.

The core idea is to make idmappings a property of struct vfsmount
instead of tying it to a process being inside of a user namespace which
has been the case for all other proposed approaches.
It means that idmappings become a property of bind-mounts, i.e. each
bind-mount can have a separate idmapping. This has the obvious advantage
that idmapped mounts can be created inside of the initial user
namespace, i.e. on the host itself instead of requiring the caller to be
located inside of a user namespace. This enables such use-cases as e.g.
making a usb stick available in multiple locations with different
idmappings (see the vfat port that is part of this patch series).

The vfsmount struct gains a new struct user_namespace member. The
idmapping of the user namespace becomes the idmapping of the mount. A
caller that is either privileged with respect to the user namespace of
the superblock of the underlying filesystem or a caller that is
privileged with respect to the user namespace a mount has been idmapped
with can create a new bind-mount and mark it with a user namespace. The
user namespace the mount will be marked with can be specified by passing
a file descriptor refering to the user namespace as an argument to the
new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag.
By default vfsmounts are marked with the initial user namespace and no
behavioral or performance changes should be observed. All mapping
operations are nops for the initial user namespace.

When a file/inode is accessed through an idmapped mount the i_uid and
i_gid of the inode will be remapped according to the user namespace the
mount has been marked with. When a new object is created based on the
fsuid and fsgid of the caller they will similarly be remapped according
to the user namespace of the mount they care created from.

This means the user namespace of the mount needs to be passed down into
a few relevant inode_operations. This mostly includes inode operations
that create filesystem objects or change file attributes. Some of them
such as ->getattr() don't even need to change since they pass down a
struct path and thus the struct vfsmount is already available. Other
inode operations need to be adapted to pass down the user namespace the
vfsmount has been marked with. Al was nice enough to point out that he
will not tolerate struct vfsmount being passed to filesystems and that I
should pass down the user namespace directly; which is what I did.
The inode struct itself is never altered whenever the i_uid and i_gid
need to be mapped, i.e. i_uid and i_gid are only remapped at the time of
the check. An inode once initialized (during lookup or object creation)
is never altered when accessed through an idmapped mount.

To limit the amount of noise in this first iteration we have not changed
the existing inode operations but rather introduced a few new struct
inode operation methods such as ->mkdir_mapped which pass down the user
namespace of the mount they have been called from. Should this solution
be worth pursuing we have no problem adapting the existing inode
operations instead.

In order to support idmapped mounts, filesystems need to be changed and
mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. In this first
iteration I tried to illustrate this by changing three different
filesystem with different levels of complexity. Of course with some bias
towards urgent use-cases and filesystems I was at least a little more
familiar with. However, Tycho and I (and others) have no problem
converting each filesystem one-by-one. This first iteration includes fat
(msdos and vfat), ext4, and overlayfs (both with idmapped lower and
upper directories and idmapped merged directories). I'm sure I haven't
gotten everything right for all three of them in the first version of
this patch.

I have written a simple tool that allows to create idmapped mounts so
people can play with this patch series. Here are a few illustrations:

1. Create a simple idmapped mount of another user's home directory

u1001@f2-vm:/$ sudo ./mount-idmapped --map-mount b:1000:1001:1 /home/ubuntu/ /mnt
u1001@f2-vm:/$ ls -al /home/ubuntu/
total 28
drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
drwxr-xr-x 4 root root 4096 Oct 28 04:00 ..
-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile
-rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ ls -al /mnt/
total 28
drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 .
drwxr-xr-x 29 root root 4096 Oct 28 22:01 ..
-rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile
-rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ touch /mnt/my-file
u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file
u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file
u1001@f2-vm:/$ ls -al /mnt/my-file
-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file
u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file
u1001@f2-vm:/$ getfacl /mnt/my-file
getfacl: Removing leading '/' from absolute path names
# file: mnt/my-file
# owner: u1001
# group: u1001
user::rw-
user:u1001:rwx
group::rw-
mask::rwx
other::r--
u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
getfacl: Removing leading '/' from absolute path names
# file: home/ubuntu/my-file
# owner: ubuntu
# group: ubuntu
user::rw-
user:ubuntu:rwx
group::rw-
mask::rwx
other::r--

2. Create mapping of the whole ext4 rootfs without a mapping for uid and gid 0

ubuntu@f2-vm:~$ sudo /mount-idmapped --map-mount b:1:1:65536 / /mnt/
ubuntu@f2-vm:~$ findmnt | grep mnt
└─/mnt /dev/sda2 ext4 rw,relatime
└─/mnt/mnt /dev/sda2 ext4 rw,relatime
ubuntu@f2-vm:~$ sudo mkdir /AS-ROOT-CAN-CREATE
ubuntu@f2-vm:~$ sudo mkdir /mnt/AS-ROOT-CANT-CREATE
mkdir: cannot create directory ‘/mnt/AS-ROOT-CANT-CREATE’: Value too large for defined data type
ubuntu@f2-vm:~$ mkdir /mnt/home/ubuntu/AS-USER-1000-CAN-CREATE

3. Create a vfat usb mount and expose to user 1001 and 5000

ubuntu@f2-vm:/$ sudo mount /dev/sdb /mnt
ubuntu@f2-vm:/$ findmnt | grep mnt
└─/mnt /dev/sdb vfat rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro
ubuntu@f2-vm:/$ ls -al /mnt
total 12
drwxr-xr-x 2 root root 4096 Jan 1 1970 .
drwxr-xr-x 34 root root 4096 Oct 28 22:24 ..
-rwxr-xr-x 1 root root 4 Oct 28 03:44 aaa
-rwxr-xr-x 1 root root 0 Oct 28 01:09 bbb
ubuntu@f2-vm:/$ sudo /mount-idmapped --map-mount b:0:1001:1 /mnt /mnt-1001/
ubuntu@f2-vm:/$ ls -al /mnt-1001/
total 12
drwxr-xr-x 2 u1001 u1001 4096 Jan 1 1970 .
drwxr-xr-x 34 root root 4096 Oct 28 22:24 ..
-rwxr-xr-x 1 u1001 u1001 4 Oct 28 03:44 aaa
-rwxr-xr-x 1 u1001 u1001 0 Oct 28 01:09 bbb
ubuntu@f2-vm:/$ sudo /mount-idmapped --map-mount b:0:5000:1 /mnt /mnt-5000/
ubuntu@f2-vm:/$ ls -al /mnt-5000/
total 12
drwxr-xr-x 2 5000 5000 4096 Jan 1 1970 .
drwxr-xr-x 34 root root 4096 Oct 28 22:24 ..
-rwxr-xr-x 1 5000 5000 4 Oct 28 03:44 aaa
-rwxr-xr-x 1 5000 5000 0 Oct 28 01:09 bbb

4. Create an idmapped rootfs mount for a container

root@f2-vm:~# ls -al /var/lib/lxc/f2/rootfs/
total 68
drwxr-xr-x 17 20000 20000 4096 Sep 24 07:48 .
drwxrwx--- 3 20000 20000 4096 Oct 16 19:26 ..
lrwxrwxrwx 1 20000 20000 7 Sep 24 07:43 bin -> usr/bin
drwxr-xr-x 2 20000 20000 4096 Apr 15 2020 boot
drwxr-xr-x 3 20000 20000 4096 Oct 16 19:26 dev
drwxr-xr-x 61 20000 20000 4096 Oct 16 19:26 etc
drwxr-xr-x 3 20000 20000 4096 Sep 24 07:45 home
lrwxrwxrwx 1 20000 20000 7 Sep 24 07:43 lib -> usr/lib
lrwxrwxrwx 1 20000 20000 9 Sep 24 07:43 lib32 -> usr/lib32
lrwxrwxrwx 1 20000 20000 9 Sep 24 07:43 lib64 -> usr/lib64
lrwxrwxrwx 1 20000 20000 10 Sep 24 07:43 libx32 -> usr/libx32
drwxr-xr-x 2 20000 20000 4096 Sep 24 07:43 media
drwxr-xr-x 2 20000 20000 4096 Sep 24 07:43 mnt
drwxr-xr-x 2 20000 20000 4096 Sep 24 07:43 opt
drwxr-xr-x 2 20000 20000 4096 Apr 15 2020 proc
drwx------ 2 20000 20000 4096 Sep 24 07:43 root
drwxr-xr-x 2 20000 20000 4096 Sep 24 07:45 run
lrwxrwxrwx 1 20000 20000 8 Sep 24 07:43 sbin -> usr/sbin
drwxr-xr-x 2 20000 20000 4096 Sep 24 07:43 srv
drwxr-xr-x 2 20000 20000 4096 Apr 15 2020 sys
drwxrwxrwt 2 20000 20000 4096 Sep 24 07:44 tmp
drwxr-xr-x 13 20000 20000 4096 Sep 24 07:43 usr
drwxr-xr-x 12 20000 20000 4096 Sep 24 07:44 var
root@f2-vm:~# /mount-idmapped --map-mount b:20000:10000:100000 /var/lib/lxc/f2/rootfs/ /mnt
root@f2-vm:~# ls -al /mnt
total 68
drwxr-xr-x 17 10000 10000 4096 Sep 24 07:48 .
drwxr-xr-x 34 root root 4096 Oct 28 22:24 ..
lrwxrwxrwx 1 10000 10000 7 Sep 24 07:43 bin -> usr/bin
drwxr-xr-x 2 10000 10000 4096 Apr 15 2020 boot
drwxr-xr-x 3 10000 10000 4096 Oct 16 19:26 dev
drwxr-xr-x 61 10000 10000 4096 Oct 16 19:26 etc
drwxr-xr-x 3 10000 10000 4096 Sep 24 07:45 home
lrwxrwxrwx 1 10000 10000 7 Sep 24 07:43 lib -> usr/lib
lrwxrwxrwx 1 10000 10000 9 Sep 24 07:43 lib32 -> usr/lib32
lrwxrwxrwx 1 10000 10000 9 Sep 24 07:43 lib64 -> usr/lib64
lrwxrwxrwx 1 10000 10000 10 Sep 24 07:43 libx32 -> usr/libx32
drwxr-xr-x 2 10000 10000 4096 Sep 24 07:43 media
drwxr-xr-x 2 10000 10000 4096 Sep 24 07:43 mnt
drwxr-xr-x 2 10000 10000 4096 Sep 24 07:43 opt
drwxr-xr-x 2 10000 10000 4096 Apr 15 2020 proc
drwx------ 2 10000 10000 4096 Sep 24 07:43 root
drwxr-xr-x 2 10000 10000 4096 Sep 24 07:45 run
lrwxrwxrwx 1 10000 10000 8 Sep 24 07:43 sbin -> usr/sbin
drwxr-xr-x 2 10000 10000 4096 Sep 24 07:43 srv
drwxr-xr-x 2 10000 10000 4096 Apr 15 2020 sys
drwxrwxrwt 2 10000 10000 4096 Sep 24 07:44 tmp
drwxr-xr-x 13 10000 10000 4096 Sep 24 07:43 usr
drwxr-xr-x 12 10000 10000 4096 Sep 24 07:44 var
root@f2-vm:~# lxc-start f2 # uses /mnt as rootfs
root@f2-vm:~# lxc-attach f2 -- cat /proc/1/uid_map
0 10000 10000
root@f2-vm:~# lxc-attach f2 -- cat /proc/1/gid_map
0 10000 10000
root@f2-vm:~# lxc-attach f2 -- ls -al /
total 52
drwxr-xr-x 17 root root 4096 Sep 24 07:48 .
drwxr-xr-x 17 root root 4096 Sep 24 07:48 ..
lrwxrwxrwx 1 root root 7 Sep 24 07:43 bin -> usr/bin
drwxr-xr-x 2 root root 4096 Apr 15 2020 boot
drwxr-xr-x 5 root root 500 Oct 28 23:39 dev
drwxr-xr-x 61 root root 4096 Oct 28 23:39 etc
drwxr-xr-x 3 root root 4096 Sep 24 07:45 home
lrwxrwxrwx 1 root root 7 Sep 24 07:43 lib -> usr/lib
lrwxrwxrwx 1 root root 9 Sep 24 07:43 lib32 -> usr/lib32
lrwxrwxrwx 1 root root 9 Sep 24 07:43 lib64 -> usr/lib64
lrwxrwxrwx 1 root root 10 Sep 24 07:43 libx32 -> usr/libx32
drwxr-xr-x 2 root root 4096 Sep 24 07:43 media
drwxr-xr-x 2 root root 4096 Sep 24 07:43 mnt
drwxr-xr-x 2 root root 4096 Sep 24 07:43 opt
dr-xr-xr-x 232 nobody nogroup 0 Oct 28 23:39 proc
drwx------ 2 root root 4096 Oct 28 23:41 root
drwxr-xr-x 12 root root 360 Oct 28 23:39 run
lrwxrwxrwx 1 root root 8 Sep 24 07:43 sbin -> usr/sbin
drwxr-xr-x 2 root root 4096 Sep 24 07:43 srv
dr-xr-xr-x 13 nobody nogroup 0 Oct 28 23:39 sys
drwxrwxrwt 11 root root 4096 Oct 28 23:40 tmp
drwxr-xr-x 13 root root 4096 Sep 24 07:43 usr
drwxr-xr-x 12 root root 4096 Sep 24 07:44 var
root@f2-vm:~# lxc-attach f2 -- ls -al /my-file
-rw-r--r-- 1 root root 0 Oct 28 23:43 /my-file
root@f2-vm:~# ls -al /var/lib/lxc/f2/rootfs/my-file
-rw-r--r-- 1 20000 20000 0 Oct 28 23:43 /var/lib/lxc/f2/rootfs/my-file

[1]: https://systemd.io/HOME_DIRECTORY/
"If the UID assigned to a user does not match the owner of the home
directory in the file system, the home directory is automatically
and recursively chown()ed to the correct UID."
This has huge performance impact and is also problematic since it
chowns all files independent of ownership.
[2]: https://github.com/brauner/mount-idmapped

In no particular order I'd like to say thanks to:
Al for pointing me into the direction to avoid inode alias issues during
lookup. David for various discussions around this. Tycho for helping
with this series and on future patches if this is in any shape or form
acceptable. Alban Crequy for pointing out more application container
use-cases. Stéphane for various valuable input on various use-cases and
letting me work on this. Amir for explaining and discussing aspects of
overlayfs with me.
I'd like to especially thank Seth Forshee because he provided a lot of
good analysis, suggestions, and participated in short-notice discussions
in both chat and video.

This series can be found and pulled in three locations:
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=idmapped_mounts
https://github.com/brauner/linux/tree/idmapped_mounts
https://gitlab.com/brauner/linux/-/commits/idmapped_mounts

Thanks!
Christian

Christian Brauner (32):
namespace: take lock_mount_hash() directly when changing flags
namespace: only take read lock in do_reconfigure_mnt()
fs: add mount_setattr()
tests: add mount_setattr() selftests
fs: introduce MOUNT_ATTR_IDMAP
fs: add id translation helpers
capability: handle idmapped mounts
namei: add idmapped mount aware permission helpers
inode: add idmapped mount aware init and permission helpers
attr: handle idmapped mounts
acl: handle idmapped mounts
commoncap: handle idmapped mounts
stat: add mapped_generic_fillattr()
namei: handle idmapped mounts in may_*() helpers
namei: introduce struct renamedata
namei: prepare for idmapped mounts
namei: add lookup helpers with idmapped mounts aware permission
checking
open: handle idmapped mounts in do_truncate()
open: handle idmapped mounts
af_unix: handle idmapped mounts
utimes: handle idmapped mounts
would_dump: handle idmapped mounts
exec: handle idmapped mounts
fs: add helpers for idmap mounts
apparmor: handle idmapped mounts
audit: handle idmapped mounts
ima: handle idmapped mounts
ext4: support idmapped mounts
expfs: handle idmapped mounts
overlayfs: handle idmapped lower directories
overlayfs: handle idmapped merged mounts
fat: handle idmapped mounts

Tycho Andersen (2):
xattr: handle idmapped mounts
selftests: add idmapped mounts xattr selftest

arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/Kconfig | 6 +
fs/attr.c | 142 ++-
fs/coredump.c | 12 +-
fs/exec.c | 12 +-
fs/exportfs/expfs.c | 4 +-
fs/ext4/acl.c | 11 +-
fs/ext4/acl.h | 3 +
fs/ext4/ext4.h | 14 +-
fs/ext4/file.c | 4 +
fs/ext4/ialloc.c | 7 +-
fs/ext4/inode.c | 27 +-
fs/ext4/ioctl.c | 18 +-
fs/ext4/namei.c | 145 ++-
fs/ext4/super.c | 4 +
fs/ext4/symlink.c | 9 +
fs/ext4/xattr_hurd.c | 22 +-
fs/ext4/xattr_security.c | 18 +-
fs/ext4/xattr_trusted.c | 18 +-
fs/fat/fat.h | 2 +
fs/fat/file.c | 27 +-
fs/fat/namei_msdos.c | 7 +
fs/fat/namei_vfat.c | 7 +
fs/inode.c | 66 +-
fs/internal.h | 9 +
fs/namei.c | 597 ++++++++----
fs/namespace.c | 446 ++++++++-
fs/open.c | 52 +-
fs/overlayfs/copy_up.c | 104 +-
fs/overlayfs/dir.c | 219 +++--
fs/overlayfs/export.c | 3 +-
fs/overlayfs/file.c | 23 +-
fs/overlayfs/inode.c | 121 ++-
fs/overlayfs/namei.c | 64 +-
fs/overlayfs/overlayfs.h | 158 +++-
fs/overlayfs/ovl_entry.h | 1 +
fs/overlayfs/readdir.c | 34 +-
fs/overlayfs/super.c | 109 ++-
fs/overlayfs/util.c | 38 +-
fs/posix_acl.c | 130 ++-
fs/stat.c | 18 +-
fs/utimes.c | 4 +-
fs/xattr.c | 264 ++++--
include/linux/audit.h | 10 +-
include/linux/capability.h | 12 +-
include/linux/fs.h | 254 ++++-
include/linux/ima.h | 15 +-
include/linux/lsm_hook_defs.h | 10 +-
include/linux/lsm_hooks.h | 1 +
include/linux/mount.h | 20 +-
include/linux/namei.h | 6 +
include/linux/posix_acl.h | 14 +-
include/linux/posix_acl_xattr.h | 12 +-
include/linux/security.h | 36 +-
include/linux/syscalls.h | 3 +
include/linux/xattr.h | 29 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/mount.h | 26 +
ipc/mqueue.c | 8 +-
kernel/auditsc.c | 29 +-
kernel/capability.c | 22 +-
net/unix/af_unix.c | 2 +-
security/apparmor/domain.c | 9 +-
security/apparmor/file.c | 5 +-
security/apparmor/lsm.c | 12 +-
security/commoncap.c | 50 +-
security/integrity/ima/ima.h | 19 +-
security/integrity/ima/ima_api.c | 10 +-
security/integrity/ima/ima_appraise.c | 14 +-
security/integrity/ima/ima_asymmetric_keys.c | 2 +-
security/integrity/ima/ima_main.c | 28 +-
security/integrity/ima/ima_policy.c | 17 +-
security/integrity/ima/ima_queue_keys.c | 2 +-
security/security.c | 18 +-
security/selinux/hooks.c | 13 +-
security/smack/smack_lsm.c | 11 +-
tools/include/uapi/asm-generic/unistd.h | 4 +-
tools/testing/selftests/Makefile | 1 +
.../testing/selftests/idmap_mounts/.gitignore | 1 +
tools/testing/selftests/idmap_mounts/Makefile | 8 +
tools/testing/selftests/idmap_mounts/config | 1 +
tools/testing/selftests/idmap_mounts/xattr.c | 389 ++++++++
.../selftests/mount_setattr/.gitignore | 1 +
.../testing/selftests/mount_setattr/Makefile | 7 +
tools/testing/selftests/mount_setattr/config | 1 +
.../mount_setattr/mount_setattr_test.c | 888 ++++++++++++++++++
102 files changed, 4109 insertions(+), 912 deletions(-)
create mode 100644 tools/testing/selftests/idmap_mounts/.gitignore
create mode 100644 tools/testing/selftests/idmap_mounts/Makefile
create mode 100644 tools/testing/selftests/idmap_mounts/config
create mode 100644 tools/testing/selftests/idmap_mounts/xattr.c
create mode 100644 tools/testing/selftests/mount_setattr/.gitignore
create mode 100644 tools/testing/selftests/mount_setattr/Makefile
create mode 100644 tools/testing/selftests/mount_setattr/config
create mode 100644 tools/testing/selftests/mount_setattr/mount_setattr_test.c


base-commit: 3650b228f83adda7e5ee532e2b90429c03f7b9ec
--
2.29.0


2020-10-29 08:36:17

by Christian Brauner

[permalink] [raw]
Subject: [PATCH 08/34] namei: add idmapped mount aware permission helpers

The two helpers inode_permission() and generic_permission() are used by
the vfs to perform basic permission checking by verifying that the
caller is privileged over an inode. In order to handle idmapped mount we
add the two helpers mapped_inode_permission() to
mapped_generic_permission() which take a user namespace argument. On
idmapped mounts the two new helpers will make sure to map the inode
according to the mount's user namespace and then peform identical
permission checks to inode_permission() and generic_permission(). If the
initial user namespace is passed mapped_inode_permission() and
mapped_generic_permission() are identical to inode_permission() and
generic_permission() so there will be no performance impact on
non-idmapped mounts. This also means that the inode_permission() and
generic_permission() helpers can be implemented on top of
mapped_inode_permission() and mapped_generic_permission() respectively
by just passing in the initial user namespace so no code is
unnecessarily duplicated.

Signed-off-by: Christian Brauner <[email protected]>
---
fs/namei.c | 71 ++++++++++++++++++++++++++++-----------
fs/posix_acl.c | 16 ++++++---
include/linux/fs.h | 2 ++
include/linux/posix_acl.h | 4 ++-
4 files changed, 67 insertions(+), 26 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index d4a6dd772303..2635f6a57de5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -259,7 +259,7 @@ void putname(struct filename *name)
__putname(name);
}

-static int check_acl(struct inode *inode, int mask)
+static int check_acl(struct user_namespace *user_ns, struct inode *inode, int mask)
{
#ifdef CONFIG_FS_POSIX_ACL
struct posix_acl *acl;
@@ -271,14 +271,14 @@ static int check_acl(struct inode *inode, int mask)
/* no ->get_acl() calls in RCU mode... */
if (is_uncached_acl(acl))
return -ECHILD;
- return posix_acl_permission(inode, acl, mask);
+ return posix_acl_permission(user_ns, inode, acl, mask);
}

acl = get_acl(inode, ACL_TYPE_ACCESS);
if (IS_ERR(acl))
return PTR_ERR(acl);
if (acl) {
- int error = posix_acl_permission(inode, acl, mask);
+ int error = posix_acl_permission(user_ns, inode, acl, mask);
posix_acl_release(acl);
return error;
}
@@ -293,12 +293,14 @@ static int check_acl(struct inode *inode, int mask)
* Note that the POSIX ACL check cares about the MAY_NOT_BLOCK bit,
* for RCU walking.
*/
-static int acl_permission_check(struct inode *inode, int mask)
+static int acl_permission_check(struct user_namespace *user_ns, struct inode *inode, int mask)
{
unsigned int mode = inode->i_mode;
+ kuid_t i_uid;

/* Are we the owner? If so, ACL's don't matter */
- if (likely(uid_eq(current_fsuid(), inode->i_uid))) {
+ i_uid = i_uid_into_mnt(user_ns, inode);
+ if (likely(uid_eq(current_fsuid(), i_uid))) {
mask &= 7;
mode >>= 6;
return (mask & ~mode) ? -EACCES : 0;
@@ -306,7 +308,7 @@ static int acl_permission_check(struct inode *inode, int mask)

/* Do we have ACL's? */
if (IS_POSIXACL(inode) && (mode & S_IRWXG)) {
- int error = check_acl(inode, mask);
+ int error = check_acl(user_ns, inode, mask);
if (error != -EAGAIN)
return error;
}
@@ -320,7 +322,8 @@ static int acl_permission_check(struct inode *inode, int mask)
* about? Need to check group ownership if so.
*/
if (mask & (mode ^ (mode >> 3))) {
- if (in_group_p(inode->i_gid))
+ kgid_t kgid = i_gid_into_mnt(user_ns, inode);
+ if (in_group_p(kgid))
mode >>= 3;
}

@@ -329,7 +332,7 @@ static int acl_permission_check(struct inode *inode, int mask)
}

/**
- * generic_permission - check for access rights on a Posix-like filesystem
+ * mapped_generic_permission - check for access rights on a Posix-like filesystem
* @inode: inode to check access rights for
* @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC,
* %MAY_NOT_BLOCK ...)
@@ -343,24 +346,25 @@ static int acl_permission_check(struct inode *inode, int mask)
* request cannot be satisfied (eg. requires blocking or too much complexity).
* It would then be called again in ref-walk mode.
*/
-int generic_permission(struct inode *inode, int mask)
+int mapped_generic_permission(struct user_namespace *user_ns, struct inode *inode,
+ int mask)
{
int ret;

/*
* Do the basic permission checks.
*/
- ret = acl_permission_check(inode, mask);
+ ret = acl_permission_check(user_ns, inode, mask);
if (ret != -EACCES)
return ret;

if (S_ISDIR(inode->i_mode)) {
/* DACs are overridable for directories */
if (!(mask & MAY_WRITE))
- if (capable_wrt_inode_uidgid(inode,
- CAP_DAC_READ_SEARCH))
+ if (capable_wrt_mapped_inode_uidgid(user_ns, inode,
+ CAP_DAC_READ_SEARCH))
return 0;
- if (capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE))
+ if (capable_wrt_mapped_inode_uidgid(user_ns, inode, CAP_DAC_OVERRIDE))
return 0;
return -EACCES;
}
@@ -370,7 +374,8 @@ int generic_permission(struct inode *inode, int mask)
*/
mask &= MAY_READ | MAY_WRITE | MAY_EXEC;
if (mask == MAY_READ)
- if (capable_wrt_inode_uidgid(inode, CAP_DAC_READ_SEARCH))
+ if (capable_wrt_mapped_inode_uidgid(user_ns, inode,
+ CAP_DAC_READ_SEARCH))
return 0;
/*
* Read/write DACs are always overridable.
@@ -378,11 +383,18 @@ int generic_permission(struct inode *inode, int mask)
* at least one exec bit set.
*/
if (!(mask & MAY_EXEC) || (inode->i_mode & S_IXUGO))
- if (capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE))
+ if (capable_wrt_mapped_inode_uidgid(user_ns, inode,
+ CAP_DAC_OVERRIDE))
return 0;

return -EACCES;
}
+EXPORT_SYMBOL(mapped_generic_permission);
+
+int generic_permission(struct inode *inode, int mask)
+{
+ return mapped_generic_permission(&init_user_ns, inode, mask);
+}
EXPORT_SYMBOL(generic_permission);

/*
@@ -391,7 +403,7 @@ EXPORT_SYMBOL(generic_permission);
* flag in inode->i_opflags, that says "this has not special
* permission function, use the fast case".
*/
-static inline int do_inode_permission(struct inode *inode, int mask)
+static inline int do_inode_permission(struct user_namespace *user_ns, struct inode *inode, int mask)
{
if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
if (likely(inode->i_op->permission))
@@ -402,7 +414,7 @@ static inline int do_inode_permission(struct inode *inode, int mask)
inode->i_opflags |= IOP_FASTPERM;
spin_unlock(&inode->i_lock);
}
- return generic_permission(inode, mask);
+ return mapped_generic_permission(user_ns, inode, mask);
}

/**
@@ -426,7 +438,9 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
}

/**
- * inode_permission - Check for access rights to a given inode
+ * mapped_inode_permission - Check for access rights to a given inode as seen from
+ * a given user namespace
+ * @userns: The user namespace the inode is seen from
* @inode: Inode to check permission on
* @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
*
@@ -436,7 +450,7 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
*
* When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
*/
-int inode_permission(struct inode *inode, int mask)
+int mapped_inode_permission(struct user_namespace *user_ns, struct inode *inode, int mask)
{
int retval;

@@ -460,7 +474,7 @@ int inode_permission(struct inode *inode, int mask)
return -EACCES;
}

- retval = do_inode_permission(inode, mask);
+ retval = do_inode_permission(user_ns, inode, mask);
if (retval)
return retval;

@@ -470,6 +484,23 @@ int inode_permission(struct inode *inode, int mask)

return security_inode_permission(inode, mask);
}
+EXPORT_SYMBOL(mapped_inode_permission);
+
+/**
+ * inode_permission - Check for access rights to a given inode
+ * @inode: Inode to check permission on
+ * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
+ *
+ * Check for read/write/execute permissions on an inode. We use fs[ug]id for
+ * this, letting us set arbitrary permissions for filesystem access without
+ * changing the "normal" UIDs which are used for other things.
+ *
+ * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
+ */
+int inode_permission(struct inode *inode, int mask)
+{
+ return mapped_inode_permission(&init_user_ns, inode, mask);
+}
EXPORT_SYMBOL(inode_permission);

/**
diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 95882b3f5f62..f15b6ad35ec3 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -345,10 +345,12 @@ EXPORT_SYMBOL(posix_acl_from_mode);
* by the acl. Returns -E... otherwise.
*/
int
-posix_acl_permission(struct inode *inode, const struct posix_acl *acl, int want)
+posix_acl_permission(struct user_namespace *user_ns, struct inode *inode, const struct posix_acl *acl, int want)
{
const struct posix_acl_entry *pa, *pe, *mask_obj;
int found = 0;
+ kuid_t uid;
+ kgid_t gid;

want &= MAY_READ | MAY_WRITE | MAY_EXEC;

@@ -356,22 +358,26 @@ posix_acl_permission(struct inode *inode, const struct posix_acl *acl, int want)
switch(pa->e_tag) {
case ACL_USER_OBJ:
/* (May have been checked already) */
- if (uid_eq(inode->i_uid, current_fsuid()))
+ uid = i_uid_into_mnt(user_ns, inode);
+ if (uid_eq(uid, current_fsuid()))
goto check_perm;
break;
case ACL_USER:
- if (uid_eq(pa->e_uid, current_fsuid()))
+ uid = kuid_into_mnt(user_ns, pa->e_uid);
+ if (uid_eq(uid, current_fsuid()))
goto mask;
break;
case ACL_GROUP_OBJ:
- if (in_group_p(inode->i_gid)) {
+ gid = i_gid_into_mnt(user_ns, inode);
+ if (in_group_p(gid)) {
found = 1;
if ((pa->e_perm & want) == want)
goto mask;
}
break;
case ACL_GROUP:
- if (in_group_p(pa->e_gid)) {
+ gid = kgid_into_mnt(user_ns, pa->e_gid);
+ if (in_group_p(gid)) {
found = 1;
if ((pa->e_perm & want) == want)
goto mask;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8a891b80d0b4..750ca4b3d89f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2820,7 +2820,9 @@ static inline int bmap(struct inode *inode, sector_t *block)

extern int notify_change(struct dentry *, struct iattr *, struct inode **);
extern int inode_permission(struct inode *, int);
+extern int mapped_inode_permission(struct user_namespace *, struct inode *, int);
extern int generic_permission(struct inode *, int);
+extern int mapped_generic_permission(struct user_namespace *, struct inode *, int);
extern int __check_sticky(struct inode *dir, struct inode *inode);

static inline bool execute_ok(struct inode *inode)
diff --git a/include/linux/posix_acl.h b/include/linux/posix_acl.h
index 90797f1b421d..8276baefed13 100644
--- a/include/linux/posix_acl.h
+++ b/include/linux/posix_acl.h
@@ -15,6 +15,8 @@
#include <linux/refcount.h>
#include <uapi/linux/posix_acl.h>

+struct user_namespace;
+
struct posix_acl_entry {
short e_tag;
unsigned short e_perm;
@@ -62,7 +64,7 @@ posix_acl_release(struct posix_acl *acl)
extern void posix_acl_init(struct posix_acl *, int);
extern struct posix_acl *posix_acl_alloc(int, gfp_t);
extern int posix_acl_valid(struct user_namespace *, const struct posix_acl *);
-extern int posix_acl_permission(struct inode *, const struct posix_acl *, int);
+extern int posix_acl_permission(struct user_namespace *, struct inode *, const struct posix_acl *, int);
extern struct posix_acl *posix_acl_from_mode(umode_t, gfp_t);
extern int posix_acl_equiv_mode(const struct posix_acl *, umode_t *);
extern int __posix_acl_create(struct posix_acl **, gfp_t, umode_t *);
--
2.29.0

2020-10-29 08:36:19

by Christian Brauner

[permalink] [raw]
Subject: [PATCH 09/34] inode: add idmapped mount aware init and permission helpers

The inode_owner_or_capable() helper determines whether the caller is the
owner of the inode or is capable with respect to that inode. Add a new
mapped_inode_owner_or_capable() helper to handle idmapped mounts. If the
If the inode is accessed through an idmapped mount we first need to map
it according to the mount's user namespace. Afterwards the checks are
identical to non-idmapped mounts. If the initial user namespace is
passed all operations are a nop so non-idmapped mounts will not see a
change in behavior and will also not see any performance impact. It also
means that the inode_owner_or_capable() helper can be implemented on top
of mapped_inode_owner_or_capable() by passing in the initial user
namespace.

Similarly, we add a new mapped_inode_init_owner() helper which
initializes a new inode on idmapped mounts by mapping the fsuid and
fsgid of the caller from the mount's user namespace. If the initial user
namespace is passed all operations are a nop so non-idmapped mounts will
not see a change in behavior and will also not see any performance
impact. It also means that the inode_init_owner() helper can be
implemented on top of mapped_inode_init_owner() by passing in the
initial user namespace.

Signed-off-by: Christian Brauner <[email protected]>
---
fs/inode.c | 53 ++++++++++++++++++++++++++++++++++++----------
include/linux/fs.h | 4 ++++
2 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 9d78c37b00b8..22de3cb3b1f4 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2130,15 +2130,17 @@ void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
EXPORT_SYMBOL(init_special_inode);

/**
- * inode_init_owner - Init uid,gid,mode for new inode according to posix standards
+ * mapped_inode_init_owner - Init uid,gid,mode for new inode according to posix
+ * standards on idmapped mounts
* @inode: New inode
+ * @user_ns: User namespace the inode is accessed from
* @dir: Directory inode
* @mode: mode of the new inode
*/
-void inode_init_owner(struct inode *inode, const struct inode *dir,
- umode_t mode)
+void mapped_inode_init_owner(struct inode *inode, struct user_namespace *user_ns,
+ const struct inode *dir, umode_t mode)
{
- inode->i_uid = current_fsuid();
+ inode->i_uid = fsuid_into_mnt(user_ns);
if (dir && dir->i_mode & S_ISGID) {
inode->i_gid = dir->i_gid;

@@ -2146,34 +2148,63 @@ void inode_init_owner(struct inode *inode, const struct inode *dir,
if (S_ISDIR(mode))
mode |= S_ISGID;
else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
- !in_group_p(inode->i_gid) &&
- !capable_wrt_inode_uidgid(dir, CAP_FSETID))
+ !in_group_p(i_gid_into_mnt(user_ns, inode)) &&
+ !capable_wrt_mapped_inode_uidgid(user_ns, dir, CAP_FSETID))
mode &= ~S_ISGID;
} else
- inode->i_gid = current_fsgid();
+ inode->i_gid = fsgid_into_mnt(user_ns);
inode->i_mode = mode;
}
+EXPORT_SYMBOL(mapped_inode_init_owner);
+
+/**
+ * inode_init_owner - Init uid,gid,mode for new inode according to posix standards
+ * @inode: New inode
+ * @dir: Directory inode
+ * @mode: mode of the new inode
+ */
+void inode_init_owner(struct inode *inode, const struct inode *dir,
+ umode_t mode)
+{
+ return mapped_inode_init_owner(inode, &init_user_ns, dir, mode);
+}
EXPORT_SYMBOL(inode_init_owner);

/**
- * inode_owner_or_capable - check current task permissions to inode
+ * mapped_inode_owner_or_capable - check current task permissions to inode on idmapped mounts
+ * @user_ns: User namespace the inode is accessed from
* @inode: inode being checked
*
* Return true if current either has CAP_FOWNER in a namespace with the
* inode owner uid mapped, or owns the file.
*/
-bool inode_owner_or_capable(const struct inode *inode)
+bool mapped_inode_owner_or_capable(struct user_namespace *user_ns, const struct inode *inode)
{
+ kuid_t i_uid;
struct user_namespace *ns;

- if (uid_eq(current_fsuid(), inode->i_uid))
+ i_uid = i_uid_into_mnt(user_ns, inode);
+ if (uid_eq(current_fsuid(), i_uid))
return true;

ns = current_user_ns();
- if (kuid_has_mapping(ns, inode->i_uid) && ns_capable(ns, CAP_FOWNER))
+ if (kuid_has_mapping(ns, i_uid) && ns_capable(ns, CAP_FOWNER))
return true;
return false;
}
+EXPORT_SYMBOL(mapped_inode_owner_or_capable);
+
+/**
+ * inode_owner_or_capable - check current task permissions to inode
+ * @inode: inode being checked
+ *
+ * Return true if current either has CAP_FOWNER in a namespace with the
+ * inode owner uid mapped, or owns the file.
+ */
+bool inode_owner_or_capable(const struct inode *inode)
+{
+ return mapped_inode_owner_or_capable(&init_user_ns, inode);
+}
EXPORT_SYMBOL(inode_owner_or_capable);

/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 750ca4b3d89f..f9e2d292b7b6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1777,6 +1777,8 @@ static inline int sb_start_intwrite_trylock(struct super_block *sb)


extern bool inode_owner_or_capable(const struct inode *inode);
+extern bool mapped_inode_owner_or_capable(struct user_namespace *ns,
+ const struct inode *inode);

/*
* VFS helper functions..
@@ -1820,6 +1822,8 @@ extern long compat_ptr_ioctl(struct file *file, unsigned int cmd,
*/
extern void inode_init_owner(struct inode *inode, const struct inode *dir,
umode_t mode);
+extern void mapped_inode_init_owner(struct inode *inode, struct user_namespace *user_ns,
+ const struct inode *dir, umode_t mode);
extern bool may_open_dev(const struct path *path);

/*
--
2.29.0

2020-10-29 08:36:22

by Christian Brauner

[permalink] [raw]
Subject: [PATCH 06/34] fs: add id translation helpers

Add simple helpers to make it easy to map kuids into and from idmapped
mounts. We provide simple wrappers that filesystems can use to
e.g. initialize inodes similar to i_{uid,gid}_read() and
i_{uid,gid}_write(). Accessing an inode through an idmapped mount will
require the inode to be mapped according to the mount's user namespace.
If the fsids are used to compare against inodes or to initialize inodes
they are required to be shifted from the mount's user namespace. Passing
the initial user namespace to these helpers makes them a nop and so any
non-idmapped paths will not be impacted.

Signed-off-by: Christian Brauner <[email protected]>
---
include/linux/fs.h | 75 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 75 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8314cd351673..8a891b80d0b4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -39,6 +39,7 @@
#include <linux/fs_types.h>
#include <linux/build_bug.h>
#include <linux/stddef.h>
+#include <linux/cred.h>

#include <asm/byteorder.h>
#include <uapi/linux/fs.h>
@@ -1574,6 +1575,80 @@ static inline void i_gid_write(struct inode *inode, gid_t gid)
inode->i_gid = make_kgid(inode->i_sb->s_user_ns, gid);
}

+static inline kuid_t kuid_into_mnt(struct user_namespace *to, kuid_t kuid)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+ return make_kuid(to, __kuid_val(kuid));
+#else
+ return kuid;
+#endif
+}
+
+static inline kgid_t kgid_into_mnt(struct user_namespace *to, kgid_t kgid)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+ return make_kgid(to, __kgid_val(kgid));
+#else
+ return kgid;
+#endif
+}
+
+static inline kuid_t i_uid_into_mnt(struct user_namespace *to,
+ const struct inode *inode)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+ return kuid_into_mnt(to, inode->i_uid);
+#else
+ return inode->i_uid;
+#endif
+}
+
+static inline kgid_t i_gid_into_mnt(struct user_namespace *to,
+ const struct inode *inode)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+ return kgid_into_mnt(to, inode->i_gid);
+#else
+ return inode->i_gid;
+#endif
+}
+
+static inline kuid_t kuid_from_mnt(struct user_namespace *to, kuid_t kuid)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+ return KUIDT_INIT(from_kuid(to, kuid));
+#else
+ return kuid;
+#endif
+}
+
+static inline kgid_t kgid_from_mnt(struct user_namespace *to, kgid_t kgid)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+ return KGIDT_INIT(from_kgid(to, kgid));
+#else
+ return kgid;
+#endif
+}
+
+static inline kuid_t fsuid_into_mnt(struct user_namespace *to)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+ return kuid_from_mnt(to, current_fsuid());
+#else
+ return current_fsuid();
+#endif
+}
+
+static inline kgid_t fsgid_into_mnt(struct user_namespace *to)
+{
+#ifdef CONFIG_IDMAP_MOUNTS
+ return kgid_from_mnt(to, current_fsgid());
+#else
+ return current_fsgid();
+#endif
+}
+
extern struct timespec64 current_time(struct inode *inode);

/*
--
2.29.0

2020-10-29 08:57:32

by Sargun Dhillon

[permalink] [raw]
Subject: Re: [PATCH 00/34] fs: idmapped mounts

On Thu, Oct 29, 2020 at 01:32:18AM +0100, Christian Brauner wrote:
> Hey everyone,
>
> I vanished for a little while to focus on this work here so sorry for
> not being available by mail for a while.
>
> Since quite a long time we have issues with sharing mounts between
> multiple unprivileged containers with different id mappings, sharing a
> rootfs between multiple containers with different id mappings, and also
> sharing regular directories and filesystems between users with different
> uids and gids. The latter use-cases have become even more important with
> the availability and adoption of systemd-homed (cf. [1]) to implement
> portable home directories.
>
> The solutions we have tried and proposed so far include the introduction
> of fsid mappings, a tiny overlay based filesystem, and an approach to
> call override creds in the vfs. None of these solutions have covered all
> of the above use-cases.
>
> The solution proposed here has it's origins in multiple discussions
> during Linux Plumbers 2017 during and after the end of the containers
> microconference.
> To the best of my knowledge this involved Aleksa, St?phane, Eric, David,
> James, and myself. A variant of the solution proposed here has also been
> discussed, again to the best of my knowledge, after a Linux conference
> in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> after Linux Plumbers.
> I've taken the time to finally implement a working version of this
> solution over the last weeks to the best of my abilities. Tycho has
> signed up for this sligthly crazy endeavour as well and he has helped
> with the conversion of the xattr codepaths.
>
> The core idea is to make idmappings a property of struct vfsmount
> instead of tying it to a process being inside of a user namespace which
> has been the case for all other proposed approaches.
> It means that idmappings become a property of bind-mounts, i.e. each
> bind-mount can have a separate idmapping. This has the obvious advantage
> that idmapped mounts can be created inside of the initial user
> namespace, i.e. on the host itself instead of requiring the caller to be
> located inside of a user namespace. This enables such use-cases as e.g.
> making a usb stick available in multiple locations with different
> idmappings (see the vfat port that is part of this patch series).
>
> The vfsmount struct gains a new struct user_namespace member. The
> idmapping of the user namespace becomes the idmapping of the mount. A
> caller that is either privileged with respect to the user namespace of
> the superblock of the underlying filesystem or a caller that is
> privileged with respect to the user namespace a mount has been idmapped
> with can create a new bind-mount and mark it with a user namespace. The
> user namespace the mount will be marked with can be specified by passing
> a file descriptor refering to the user namespace as an argument to the
> new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag.
> By default vfsmounts are marked with the initial user namespace and no
> behavioral or performance changes should be observed. All mapping
> operations are nops for the initial user namespace.
>
> When a file/inode is accessed through an idmapped mount the i_uid and
> i_gid of the inode will be remapped according to the user namespace the
> mount has been marked with. When a new object is created based on the
> fsuid and fsgid of the caller they will similarly be remapped according
> to the user namespace of the mount they care created from.
>
> This means the user namespace of the mount needs to be passed down into
> a few relevant inode_operations. This mostly includes inode operations
> that create filesystem objects or change file attributes. Some of them
> such as ->getattr() don't even need to change since they pass down a
> struct path and thus the struct vfsmount is already available. Other
> inode operations need to be adapted to pass down the user namespace the
> vfsmount has been marked with. Al was nice enough to point out that he
> will not tolerate struct vfsmount being passed to filesystems and that I
> should pass down the user namespace directly; which is what I did.
> The inode struct itself is never altered whenever the i_uid and i_gid
> need to be mapped, i.e. i_uid and i_gid are only remapped at the time of
> the check. An inode once initialized (during lookup or object creation)
> is never altered when accessed through an idmapped mount.
>
> To limit the amount of noise in this first iteration we have not changed
> the existing inode operations but rather introduced a few new struct
> inode operation methods such as ->mkdir_mapped which pass down the user
> namespace of the mount they have been called from. Should this solution
> be worth pursuing we have no problem adapting the existing inode
> operations instead.
>
> In order to support idmapped mounts, filesystems need to be changed and
> mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. In this first
> iteration I tried to illustrate this by changing three different
> filesystem with different levels of complexity. Of course with some bias
> towards urgent use-cases and filesystems I was at least a little more
> familiar with. However, Tycho and I (and others) have no problem
> converting each filesystem one-by-one. This first iteration includes fat
> (msdos and vfat), ext4, and overlayfs (both with idmapped lower and
> upper directories and idmapped merged directories). I'm sure I haven't
> gotten everything right for all three of them in the first version of
> this patch.
>

Thanks for this patchset. It's been a long-time coming.

I'm curious as to for the most cases, how much the new fs mount APIs help, and
if focusing on those could solve the problem for everything other than bind
mounts? Specifically, the idea of doing fsopen (creation of fs_context) under
the user namespace of question, and relying on a user with CAP_SYS_ADMIN to call
fsmount[1]. I think this is actually especially valuable for places like
overlayfs that use the entire cred object, as opposed to just the uid / gid. I
imagine that soon, most filesystems will support the new mount APIs, and not set
the global flag if they don't need to.

How popular is the "vfsmount (bind mounts) needs different uid mappings" use
case?

The other thing I worry about is the "What UID are you really?" game that's been
a thing recently. For example, you can have a different user namespace UID
mapping for your network namespace that netfilter checks[2], and a different one
for your mount namespace, and a different one that the process is actually in.
This proliferation of different mappings makes auditing, and doing things like
writing perf toolings more difficult (since I think bpf_get_current_uid_gid
use the initial user namespace still [3]).

[1]: https://lore.kernel.org/linux-nfs/[email protected]/T/#u
[2]: https://elixir.bootlin.com/linux/v5.9.1/source/net/netfilter/xt_owner.c#L37
[3]: https://elixir.bootlin.com/linux/v5.9.1/source/kernel/bpf/helpers.c#L196

2020-10-29 09:07:05

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 00/34] fs: idmapped mounts

On Thu, Oct 29, 2020 at 01:32:18AM +0100, Christian Brauner wrote:
> Hey everyone,
>
> I vanished for a little while to focus on this work here so sorry for
> not being available by mail for a while.
>
> Since quite a long time we have issues with sharing mounts between
> multiple unprivileged containers with different id mappings, sharing a
> rootfs between multiple containers with different id mappings, and also
> sharing regular directories and filesystems between users with different
> uids and gids. The latter use-cases have become even more important with
> the availability and adoption of systemd-homed (cf. [1]) to implement
> portable home directories.
>
> The solutions we have tried and proposed so far include the introduction
> of fsid mappings, a tiny overlay based filesystem, and an approach to
> call override creds in the vfs. None of these solutions have covered all
> of the above use-cases.
>
> The solution proposed here has it's origins in multiple discussions
> during Linux Plumbers 2017 during and after the end of the containers
> microconference.
> To the best of my knowledge this involved Aleksa, St?phane, Eric, David,
> James, and myself. A variant of the solution proposed here has also been
> discussed, again to the best of my knowledge, after a Linux conference
> in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> after Linux Plumbers.
> I've taken the time to finally implement a working version of this
> solution over the last weeks to the best of my abilities. Tycho has
> signed up for this sligthly crazy endeavour as well and he has helped
> with the conversion of the xattr codepaths.
>
> The core idea is to make idmappings a property of struct vfsmount
> instead of tying it to a process being inside of a user namespace which
> has been the case for all other proposed approaches.
> It means that idmappings become a property of bind-mounts, i.e. each
> bind-mount can have a separate idmapping. This has the obvious advantage
> that idmapped mounts can be created inside of the initial user
> namespace, i.e. on the host itself instead of requiring the caller to be
> located inside of a user namespace. This enables such use-cases as e.g.
> making a usb stick available in multiple locations with different
> idmappings (see the vfat port that is part of this patch series).
>
> The vfsmount struct gains a new struct user_namespace member. The
> idmapping of the user namespace becomes the idmapping of the mount. A
> caller that is either privileged with respect to the user namespace of
> the superblock of the underlying filesystem or a caller that is
> privileged with respect to the user namespace a mount has been idmapped
> with can create a new bind-mount and mark it with a user namespace. The
> user namespace the mount will be marked with can be specified by passing
> a file descriptor refering to the user namespace as an argument to the
> new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag.
> By default vfsmounts are marked with the initial user namespace and no
> behavioral or performance changes should be observed. All mapping
> operations are nops for the initial user namespace.
>
> When a file/inode is accessed through an idmapped mount the i_uid and
> i_gid of the inode will be remapped according to the user namespace the
> mount has been marked with. When a new object is created based on the
> fsuid and fsgid of the caller they will similarly be remapped according
> to the user namespace of the mount they care created from.
>
> This means the user namespace of the mount needs to be passed down into
> a few relevant inode_operations. This mostly includes inode operations
> that create filesystem objects or change file attributes.

That's really quite ... messy.

Maybe I'm missing something, but if you have the user_ns to be used
for the VFS operation we are about to execute then why can't we use
the same model as current_fsuid/current_fsgid() for passing the
filesystem credentials down to the filesystem operations? i.e.
attach it to the current->cred->fs_userns, and then the filesystem
code that actually needs to know the current userns can call
current_fs_user_ns() instead of current_user_ns(). i.e.

#define current_fs_user_ns() \
(current->cred->fs_userns ? current->cred->fs_userns \
: current->cred->userns)

At this point, the filesystem will now always have the correct
userns it is supposed to use for mapping the uid/gid, right?

Also, if we are passing work off to worker threads, duplicating
the current creds will capture this information and won't leave
random landmines where stuff doesn't work as it should because the
worker thread is unaware of the userns that it is supposed to be
doing filesytsem operations under...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-10-29 15:49:40

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 00/34] fs: idmapped mounts

Christian Brauner <[email protected]> writes:

> Hey everyone,
>
> I vanished for a little while to focus on this work here so sorry for
> not being available by mail for a while.
>
> Since quite a long time we have issues with sharing mounts between
> multiple unprivileged containers with different id mappings, sharing a
> rootfs between multiple containers with different id mappings, and also
> sharing regular directories and filesystems between users with different
> uids and gids. The latter use-cases have become even more important with
> the availability and adoption of systemd-homed (cf. [1]) to implement
> portable home directories.

Can you walk us through the motivating use case?

As of this year's LPC I had the distinct impression that the primary use
case for such a feature was due to the RLIMIT_NPROC problem where two
containers with the same users still wanted different uid mappings to
the disk because the users were conflicting with each other because of
the per user rlimits.

Fixing rlimits is straight forward to implement, and easier to manage
for implementations and administrators.



Reading up on systemd-homed it appears to be a way to have encrypted
home directories. Those home directories can either be encrypted at the
fs or at the block level. Those home directories appear to have the
goal of being luggable between systems. If the systems in question
don't have common administration of uids and gids after lugging your
encrypted home directory to another system chowning the files is
required.

Is that the use case you are looking at removing the need for
systemd-homed to avoid chowning after lugging encrypted home directories
from one system to another? Why would it be desirable to avoid the
chown?


If the goal is to solve fragmented administration of uid assignment I
suggest that it might be better to solve the administration problem so
that all of the uids of interest get assigned the same way on all of the
systems of interest. To the extent it is possible to solve the uid
assignment problem that would seem to have fewer long term
administrative problems.

Eric

2020-10-29 15:52:58

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH 00/34] fs: idmapped mounts

On 2020-10-29, Eric W. Biederman <[email protected]> wrote:
> Christian Brauner <[email protected]> writes:
>
> > Hey everyone,
> >
> > I vanished for a little while to focus on this work here so sorry for
> > not being available by mail for a while.
> >
> > Since quite a long time we have issues with sharing mounts between
> > multiple unprivileged containers with different id mappings, sharing a
> > rootfs between multiple containers with different id mappings, and also
> > sharing regular directories and filesystems between users with different
> > uids and gids. The latter use-cases have become even more important with
> > the availability and adoption of systemd-homed (cf. [1]) to implement
> > portable home directories.
>
> Can you walk us through the motivating use case?
>
> As of this year's LPC I had the distinct impression that the primary use
> case for such a feature was due to the RLIMIT_NPROC problem where two
> containers with the same users still wanted different uid mappings to
> the disk because the users were conflicting with each other because of
> the per user rlimits.
>
> Fixing rlimits is straight forward to implement, and easier to manage
> for implementations and administrators.

This is separate to the question of "isolated user namespaces" and
managing different mappings between containers. This patchset is solving
the same problem that shiftfs solved -- sharing a single directory tree
between containers that have different ID mappings. rlimits (nor any of
the other proposals we discussed at LPC) will help with this problem.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (1.69 kB)
signature.asc (235.00 B)
Download all attachments

2020-10-29 16:16:31

by Lennart Poettering

[permalink] [raw]
Subject: Re: [PATCH 00/34] fs: idmapped mounts

On Do, 29.10.20 10:47, Eric W. Biederman ([email protected]) wrote:

> Is that the use case you are looking at removing the need for
> systemd-homed to avoid chowning after lugging encrypted home directories
> from one system to another? Why would it be desirable to avoid the
> chown?

Yes, I am very interested in seeing Christian's work succeed, for the
usecase in systemd-homed. In systemd-homed each user gets their own
private file system, and these fs shall be owned by the user's local
UID, regardless in which system it is used. The UID should be an
artifact of the local, individual system in this model, and thus
the UID on of the same user/home on system A might be picked as 1010
and on another as 1543, and on a third as 1323, and it shouldn't
matter. This way, home directories become migratable without having to
universially sync UID assignments: it doesn't matter anymore what the
local UID is.

Right now we do a recursive chown() at login time to ensure the home
dir is properly owned. This has two disadvantages:

1. It's slow. In particular on large home dirs, it takes a while to go
through the whole user's homedir tree and chown/adjust ACLs for
everything.

2. Because it is so slow we take a shortcut right now: if the
top-level home dir inode itself is owned by the correct user, we
skip the recursive chowning. This means in the typical case where a
user uses the same system most of the time, and thus the UID is
stable we can avoid the slowness. But this comes at a drawback: if
the user for some reason ends up with files in their homedir owned
by an unrelated user, then we'll never notice or readjust.

> If the goal is to solve fragmented administration of uid assignment I
> suggest that it might be better to solve the administration problem so
> that all of the uids of interest get assigned the same way on all of the
> systems of interest.

Well, the goal is to make things simple and be able to use the home
dir everywhere without any prior preparation, without central UID
assignment authority.

The goal is to have a scheme that requires no administration, by
making the UID management problem go away. Hence, if you suggest
solving this by having a central administrative authority: this is
exactly what the model wants to get away from.

Or to say this differently: just because I personally use three
different computers, I certainly don't want to set up LDAP or sync
UIDs manually.

Lennart

--
Lennart Poettering, Berlin

2020-10-29 16:22:28

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 00/34] fs: idmapped mounts

On Thu, Oct 29, 2020 at 01:27:33PM +1100, Dave Chinner wrote:
> On Thu, Oct 29, 2020 at 01:32:18AM +0100, Christian Brauner wrote:
> > Hey everyone,
> >
> > I vanished for a little while to focus on this work here so sorry for
> > not being available by mail for a while.
> >
> > Since quite a long time we have issues with sharing mounts between
> > multiple unprivileged containers with different id mappings, sharing a
> > rootfs between multiple containers with different id mappings, and also
> > sharing regular directories and filesystems between users with different
> > uids and gids. The latter use-cases have become even more important with
> > the availability and adoption of systemd-homed (cf. [1]) to implement
> > portable home directories.
> >
> > The solutions we have tried and proposed so far include the introduction
> > of fsid mappings, a tiny overlay based filesystem, and an approach to
> > call override creds in the vfs. None of these solutions have covered all
> > of the above use-cases.
> >
> > The solution proposed here has it's origins in multiple discussions
> > during Linux Plumbers 2017 during and after the end of the containers
> > microconference.
> > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> > James, and myself. A variant of the solution proposed here has also been
> > discussed, again to the best of my knowledge, after a Linux conference
> > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> > after Linux Plumbers.
> > I've taken the time to finally implement a working version of this
> > solution over the last weeks to the best of my abilities. Tycho has
> > signed up for this sligthly crazy endeavour as well and he has helped
> > with the conversion of the xattr codepaths.
> >
> > The core idea is to make idmappings a property of struct vfsmount
> > instead of tying it to a process being inside of a user namespace which
> > has been the case for all other proposed approaches.
> > It means that idmappings become a property of bind-mounts, i.e. each
> > bind-mount can have a separate idmapping. This has the obvious advantage
> > that idmapped mounts can be created inside of the initial user
> > namespace, i.e. on the host itself instead of requiring the caller to be
> > located inside of a user namespace. This enables such use-cases as e.g.
> > making a usb stick available in multiple locations with different
> > idmappings (see the vfat port that is part of this patch series).
> >
> > The vfsmount struct gains a new struct user_namespace member. The
> > idmapping of the user namespace becomes the idmapping of the mount. A
> > caller that is either privileged with respect to the user namespace of
> > the superblock of the underlying filesystem or a caller that is
> > privileged with respect to the user namespace a mount has been idmapped
> > with can create a new bind-mount and mark it with a user namespace. The
> > user namespace the mount will be marked with can be specified by passing
> > a file descriptor refering to the user namespace as an argument to the
> > new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag.
> > By default vfsmounts are marked with the initial user namespace and no
> > behavioral or performance changes should be observed. All mapping
> > operations are nops for the initial user namespace.
> >
> > When a file/inode is accessed through an idmapped mount the i_uid and
> > i_gid of the inode will be remapped according to the user namespace the
> > mount has been marked with. When a new object is created based on the
> > fsuid and fsgid of the caller they will similarly be remapped according
> > to the user namespace of the mount they care created from.
> >
> > This means the user namespace of the mount needs to be passed down into
> > a few relevant inode_operations. This mostly includes inode operations
> > that create filesystem objects or change file attributes.
>
> That's really quite ... messy.

I don't agree. It's changes all across the vfs but it's not hacky in any
way since it cleanly passes down an additional argument (I'm biased of
course.).

>
> Maybe I'm missing something, but if you have the user_ns to be used
> for the VFS operation we are about to execute then why can't we use
> the same model as current_fsuid/current_fsgid() for passing the
> filesystem credentials down to the filesystem operations? i.e.
> attach it to the current->cred->fs_userns, and then the filesystem
> code that actually needs to know the current userns can call
> current_fs_user_ns() instead of current_user_ns(). i.e.
>
> #define current_fs_user_ns() \
> (current->cred->fs_userns ? current->cred->fs_userns \
> : current->cred->userns)
>
> At this point, the filesystem will now always have the correct
> userns it is supposed to use for mapping the uid/gid, right?

Thanks for this interesting idea! I have some troubles with it though.

This approach (always) seemed conceptually wrong to me. Like Tycho said
somewhere else this basically would act like a global variable which
isn't great.

There's also a substantial difference between in that the current fsuid
and fsgid are an actual property of the callers creds so to have them in
there makes perfect sense. But the user namespace of the vfsmount is a
property of the mount and as such glueing it to the callers creds when
calling into the vfs is just weird and I would very much like to avoid
this. If inode's wouldn't have an i_sb member we wouldn't suddenly start
to pass down the s_user_ns via the callers creds to the filesystems.

I'm also not fond of having to call prepare_creds() and override_creds()
all across the vfs. It's messy and prepare_creds() is especially
problematic during RCU pathwalk where we can't call it. We could
in path_init() at the start of every every lookup operation call
prepare_creds() and then override them when we need to switch the
fs_userns global variable and then put_creds() at the end of every path
walk in terminate_walk(). But this means penalizing every lookup
operations with an additional prepare_creds() which needs to be called
at least once, I think. Then during lookup we would need to
override/change this new global fs_userns variable potentially at each
mountpoint crossing to switch back to the correct fs_userns for idmapped
and non-idmapped mounts. We'd also need to rearrange a bunch of
terminate_walk() calls and we would like end up with a comparable amount
of changes only that they would indeed be more messy since we're
strapping fs_userns to the caller's creds.

I an alternative might be to have a combined approach where you pass the
user namespace around in the vfs but when calling into the filesystem
use the fs_userns global variable approach but I would very much prefer
to avoid this and instead cleanly pass down the user namespace
correctly. That's more work, it'll take longer but I and others are
around to see these changes through.

>
> Also, if we are passing work off to worker threads, duplicating
> the current creds will capture this information and won't leave
> random landmines where stuff doesn't work as it should because the
> worker thread is unaware of the userns that it is supposed to be
> doing filesytsem operations under...

That seems like a problem that can be handled by simply keeping the
userns around similar to how we need to already keep creds around.

Christian

2020-10-29 16:36:23

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 00/34] fs: idmapped mounts

On Thu, Oct 29, 2020 at 10:12:31AM -0600, Tycho Andersen wrote:
> Hi Eric,
>
> On Thu, Oct 29, 2020 at 10:47:49AM -0500, Eric W. Biederman wrote:
> > Christian Brauner <[email protected]> writes:
> >
> > > Hey everyone,
> > >
> > > I vanished for a little while to focus on this work here so sorry for
> > > not being available by mail for a while.
> > >
> > > Since quite a long time we have issues with sharing mounts between
> > > multiple unprivileged containers with different id mappings, sharing a
> > > rootfs between multiple containers with different id mappings, and also
> > > sharing regular directories and filesystems between users with different
> > > uids and gids. The latter use-cases have become even more important with
> > > the availability and adoption of systemd-homed (cf. [1]) to implement
> > > portable home directories.
> >
> > Can you walk us through the motivating use case?
> >
> > As of this year's LPC I had the distinct impression that the primary use
> > case for such a feature was due to the RLIMIT_NPROC problem where two
> > containers with the same users still wanted different uid mappings to
> > the disk because the users were conflicting with each other because of
> > the per user rlimits.
> >
> > Fixing rlimits is straight forward to implement, and easier to manage
> > for implementations and administrators.
>
> Our use case is to have the same directory exposed to several
> different containers which each have disjoint ID mappings.
>
> > Reading up on systemd-homed it appears to be a way to have encrypted
> > home directories. Those home directories can either be encrypted at the
> > fs or at the block level. Those home directories appear to have the
> > goal of being luggable between systems. If the systems in question
> > don't have common administration of uids and gids after lugging your
> > encrypted home directory to another system chowning the files is
> > required.
> >
> > Is that the use case you are looking at removing the need for
> > systemd-homed to avoid chowning after lugging encrypted home directories
> > from one system to another? Why would it be desirable to avoid the
> > chown?
>
> Not just systemd-homed, but LXD has to do this, as does our
> application at Cisco, and presumably others.
>
> Several reasons:
>
> * the chown is slow
> * the chown requires somewhere to write the delta in metadata (e.g. an
> overlay workdir, or an LV or something), and there are N copies of
> this delta, one for each container.
> * it means we need to have a +w filesystem at some point during
> execution.
> * it's ugly :). Conceptually, the kernel solves the uid shifting
> problem for us for most other kernel subsystems (including in a
> limited way fscaps) by configuring a user namespace. It feels like
> we should be able to do the same with the VFS.

And chown prevents the same inode from being shared by different
containers through different id mappings. You can overlay, but then
they can't actually share updates.

-serge

2020-10-29 16:37:27

by Sargun Dhillon

[permalink] [raw]
Subject: Re: [PATCH 00/34] fs: idmapped mounts

On Thu, Oct 29, 2020 at 05:05:02PM +0100, Lennart Poettering wrote:
> On Do, 29.10.20 10:47, Eric W. Biederman ([email protected]) wrote:
>
> > Is that the use case you are looking at removing the need for
> > systemd-homed to avoid chowning after lugging encrypted home directories
> > from one system to another? Why would it be desirable to avoid the
> > chown?
>
> Yes, I am very interested in seeing Christian's work succeed, for the
> usecase in systemd-homed. In systemd-homed each user gets their own
> private file system, and these fs shall be owned by the user's local
> UID, regardless in which system it is used. The UID should be an
> artifact of the local, individual system in this model, and thus
> the UID on of the same user/home on system A might be picked as 1010
> and on another as 1543, and on a third as 1323, and it shouldn't
> matter. This way, home directories become migratable without having to
> universially sync UID assignments: it doesn't matter anymore what the
> local UID is.
>
> Right now we do a recursive chown() at login time to ensure the home
> dir is properly owned. This has two disadvantages:
>
> 1. It's slow. In particular on large home dirs, it takes a while to go
> through the whole user's homedir tree and chown/adjust ACLs for
> everything.
>
> 2. Because it is so slow we take a shortcut right now: if the
> top-level home dir inode itself is owned by the correct user, we
> skip the recursive chowning. This means in the typical case where a
> user uses the same system most of the time, and thus the UID is
> stable we can avoid the slowness. But this comes at a drawback: if
> the user for some reason ends up with files in their homedir owned
> by an unrelated user, then we'll never notice or readjust.
>
> > If the goal is to solve fragmented administration of uid assignment I
> > suggest that it might be better to solve the administration problem so
> > that all of the uids of interest get assigned the same way on all of the
> > systems of interest.
>
> Well, the goal is to make things simple and be able to use the home
> dir everywhere without any prior preparation, without central UID
> assignment authority.
>
> The goal is to have a scheme that requires no administration, by
> making the UID management problem go away. Hence, if you suggest
> solving this by having a central administrative authority: this is
> exactly what the model wants to get away from.
>
> Or to say this differently: just because I personally use three
> different computers, I certainly don't want to set up LDAP or sync
> UIDs manually.
>
> Lennart
>
> --
> Lennart Poettering, Berlin

Can you help me understand systemd-homed a little bit?

In the man page it says:

systemd-homed is a system service that may be used to create, remove, change or
inspect home areas (directories and network mounts and real or loopback block
devices with a filesystem, optionally encrypted).

It seems that the "underlay?" (If you'll call it that, maybe there is a better
term) can either be a standalone block device (this sounds close to systemd
machined?), a btrfs subvolume (which receives its own superblock (IIRC?, I might
be wrong. It's been a while since I've used btrfs), or just be a directory
that's mapped?

What decides whether it's just a directory and bind-mounted (or a similar
vfsmount), or an actual superblock?

How is the mapping of "real UIDs" to "namespace UIDs" works when it's just a
bind mount? From the perspective of multiple user namespaces, are all
"underlying" UIDs mapped through, or if I try to look at another user's
home directory will they not show up?

Is there a reason you can't / don't / wont use overlayfs instead of bind mounts?

2020-10-29 16:38:38

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 00/34] fs: idmapped mounts

Aleksa Sarai <[email protected]> writes:

> On 2020-10-29, Eric W. Biederman <[email protected]> wrote:
>> Christian Brauner <[email protected]> writes:
>>
>> > Hey everyone,
>> >
>> > I vanished for a little while to focus on this work here so sorry for
>> > not being available by mail for a while.
>> >
>> > Since quite a long time we have issues with sharing mounts between
>> > multiple unprivileged containers with different id mappings, sharing a
>> > rootfs between multiple containers with different id mappings, and also
>> > sharing regular directories and filesystems between users with different
>> > uids and gids. The latter use-cases have become even more important with
>> > the availability and adoption of systemd-homed (cf. [1]) to implement
>> > portable home directories.
>>
>> Can you walk us through the motivating use case?
>>
>> As of this year's LPC I had the distinct impression that the primary use
>> case for such a feature was due to the RLIMIT_NPROC problem where two
>> containers with the same users still wanted different uid mappings to
>> the disk because the users were conflicting with each other because of
>> the per user rlimits.
>>
>> Fixing rlimits is straight forward to implement, and easier to manage
>> for implementations and administrators.
>
> This is separate to the question of "isolated user namespaces" and
> managing different mappings between containers. This patchset is solving
> the same problem that shiftfs solved -- sharing a single directory tree
> between containers that have different ID mappings. rlimits (nor any of
> the other proposals we discussed at LPC) will help with this problem.

First and foremost: A uid shift on write to a filesystem is a security
bug waiting to happen. This is especially in the context of facilities
like iouring, that play very agressive games with how process context
makes it to system calls.

The only reason containers were not immediately exploitable when iouring
was introduced is because the mechanisms are built so that even if
something escapes containment the security properties still apply.
Changes to the uid when writing to the filesystem does not have that
property. The tiniest slip in containment will be a security issue.

This is not even the least bit theoretical. I have seem reports of how
shitfs+overlayfs created a situation where anyone could read
/etc/shadow.

If you are going to write using the same uid to disk from different
containers the question becomes why can't those containers configure
those users to use the same kuid?

What fixing rlimits does is it fixes one of the reasons that different
containers could not share the same kuid for users that want to write to
disk with the same uid.


I humbly suggest that it will be more secure, and easier to maintain for
both developers and users if we fix the reasons people want different
containers to have the same user running with different kuids.

If not what are the reasons we fundamentally need the same on-disk user
using multiple kuids in the kernel?

Eric

2020-10-29 16:47:42

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 00/34] fs: idmapped mounts

Tycho Andersen <[email protected]> writes:

> Hi Eric,
>
> On Thu, Oct 29, 2020 at 10:47:49AM -0500, Eric W. Biederman wrote:
>> Christian Brauner <[email protected]> writes:
>>
>> > Hey everyone,
>> >
>> > I vanished for a little while to focus on this work here so sorry for
>> > not being available by mail for a while.
>> >
>> > Since quite a long time we have issues with sharing mounts between
>> > multiple unprivileged containers with different id mappings, sharing a
>> > rootfs between multiple containers with different id mappings, and also
>> > sharing regular directories and filesystems between users with different
>> > uids and gids. The latter use-cases have become even more important with
>> > the availability and adoption of systemd-homed (cf. [1]) to implement
>> > portable home directories.
>>
>> Can you walk us through the motivating use case?
>>
>> As of this year's LPC I had the distinct impression that the primary use
>> case for such a feature was due to the RLIMIT_NPROC problem where two
>> containers with the same users still wanted different uid mappings to
>> the disk because the users were conflicting with each other because of
>> the per user rlimits.
>>
>> Fixing rlimits is straight forward to implement, and easier to manage
>> for implementations and administrators.
>
> Our use case is to have the same directory exposed to several
> different containers which each have disjoint ID mappings.

Why do the you have disjoint ID mappings for the users that are writing
to disk with the same ID?

>> Reading up on systemd-homed it appears to be a way to have encrypted
>> home directories. Those home directories can either be encrypted at the
>> fs or at the block level. Those home directories appear to have the
>> goal of being luggable between systems. If the systems in question
>> don't have common administration of uids and gids after lugging your
>> encrypted home directory to another system chowning the files is
>> required.
>>
>> Is that the use case you are looking at removing the need for
>> systemd-homed to avoid chowning after lugging encrypted home directories
>> from one system to another? Why would it be desirable to avoid the
>> chown?
>
> Not just systemd-homed, but LXD has to do this,

I asked why the same disk users are assigned different kuids and the
only reason I have heard that LXD does this is the RLIMIT_NPROC problem.

Perhaps there is another reason.

In part this is why I am eager to hear peoples use case, and why I was
trying very hard to make certain we get the requirements.

I want the real requirements though and some thought, not just we did
this and it hurts. Changning the uids on write is a very hard problem,
and not just in implementating it but also in maintaining and
understanding what is going on.

Eric

2020-10-29 16:55:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 00/34] fs: idmapped mounts

Lennart Poettering <[email protected]> writes:

> On Do, 29.10.20 10:47, Eric W. Biederman ([email protected]) wrote:
>
>> Is that the use case you are looking at removing the need for
>> systemd-homed to avoid chowning after lugging encrypted home directories
>> from one system to another? Why would it be desirable to avoid the
>> chown?
>
> Yes, I am very interested in seeing Christian's work succeed, for the
> usecase in systemd-homed. In systemd-homed each user gets their own
> private file system, and these fs shall be owned by the user's local
> UID, regardless in which system it is used. The UID should be an
> artifact of the local, individual system in this model, and thus
> the UID on of the same user/home on system A might be picked as 1010
> and on another as 1543, and on a third as 1323, and it shouldn't
> matter. This way, home directories become migratable without having to
> universially sync UID assignments: it doesn't matter anymore what the
> local UID is.
>
> Right now we do a recursive chown() at login time to ensure the home
> dir is properly owned. This has two disadvantages:
>
> 1. It's slow. In particular on large home dirs, it takes a while to go
> through the whole user's homedir tree and chown/adjust ACLs for
> everything.
>
> 2. Because it is so slow we take a shortcut right now: if the
> top-level home dir inode itself is owned by the correct user, we
> skip the recursive chowning. This means in the typical case where a
> user uses the same system most of the time, and thus the UID is
> stable we can avoid the slowness. But this comes at a drawback: if
> the user for some reason ends up with files in their homedir owned
> by an unrelated user, then we'll never notice or readjust.


The classic solution to this problem for removable media are
uid=XXX and gid=XXX mount options.

I suspect a similar solution can apply here.

I don't think you need a solution that requires different kuids
to be able to write to the same filesystem uid.

>> If the goal is to solve fragmented administration of uid assignment I
>> suggest that it might be better to solve the administration problem so
>> that all of the uids of interest get assigned the same way on all of the
>> systems of interest.
>
> Well, the goal is to make things simple and be able to use the home
> dir everywhere without any prior preparation, without central UID
> assignment authority.
>
> The goal is to have a scheme that requires no administration, by
> making the UID management problem go away. Hence, if you suggest
> solving this by having a central administrative authority: this is
> exactly what the model wants to get away from.

For a files that can be accessed by more than a single user this is
fundamentally necessary. Otherwise group permissions and acls can not
work. They wind up as meaningless garbage, because without some kind of
synchronization those other users and groups simply can not be
represented.

> Or to say this differently: just because I personally use three
> different computers, I certainly don't want to set up LDAP or sync
> UIDs manually.

If they are single users systems why should you need to?

But if permissions on files are going to be at all meaningful it is
a fundamentally a requirement that there be no confusion about which
party the other parties are talking about. To the best of my knowledge
syncing uids/usernames between machines is as simple as it can get.

Eric

2020-10-29 18:07:13

by Stéphane Graber

[permalink] [raw]
Subject: Re: [PATCH 00/34] fs: idmapped mounts

On Thu, Oct 29, 2020 at 12:45 PM Eric W. Biederman
<[email protected]> wrote:
>
> Tycho Andersen <[email protected]> writes:
>
> > Hi Eric,
> >
> > On Thu, Oct 29, 2020 at 10:47:49AM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <[email protected]> writes:
> >>
> >> > Hey everyone,
> >> >
> >> > I vanished for a little while to focus on this work here so sorry for
> >> > not being available by mail for a while.
> >> >
> >> > Since quite a long time we have issues with sharing mounts between
> >> > multiple unprivileged containers with different id mappings, sharing a
> >> > rootfs between multiple containers with different id mappings, and also
> >> > sharing regular directories and filesystems between users with different
> >> > uids and gids. The latter use-cases have become even more important with
> >> > the availability and adoption of systemd-homed (cf. [1]) to implement
> >> > portable home directories.
> >>
> >> Can you walk us through the motivating use case?
> >>
> >> As of this year's LPC I had the distinct impression that the primary use
> >> case for such a feature was due to the RLIMIT_NPROC problem where two
> >> containers with the same users still wanted different uid mappings to
> >> the disk because the users were conflicting with each other because of
> >> the per user rlimits.
> >>
> >> Fixing rlimits is straight forward to implement, and easier to manage
> >> for implementations and administrators.
> >
> > Our use case is to have the same directory exposed to several
> > different containers which each have disjoint ID mappings.
>
> Why do the you have disjoint ID mappings for the users that are writing
> to disk with the same ID?
>
> >> Reading up on systemd-homed it appears to be a way to have encrypted
> >> home directories. Those home directories can either be encrypted at the
> >> fs or at the block level. Those home directories appear to have the
> >> goal of being luggable between systems. If the systems in question
> >> don't have common administration of uids and gids after lugging your
> >> encrypted home directory to another system chowning the files is
> >> required.
> >>
> >> Is that the use case you are looking at removing the need for
> >> systemd-homed to avoid chowning after lugging encrypted home directories
> >> from one system to another? Why would it be desirable to avoid the
> >> chown?
> >
> > Not just systemd-homed, but LXD has to do this,
>
> I asked why the same disk users are assigned different kuids and the
> only reason I have heard that LXD does this is the RLIMIT_NPROC problem.
>
> Perhaps there is another reason.
>
> In part this is why I am eager to hear peoples use case, and why I was
> trying very hard to make certain we get the requirements.
>
> I want the real requirements though and some thought, not just we did
> this and it hurts. Changning the uids on write is a very hard problem,
> and not just in implementating it but also in maintaining and
> understanding what is going on.

The most common cases where shiftfs is used or where folks would like
to use it today are (by importance):
- Fast container creation (by not having to uid/gid shift all files
in the downloaded image)
- Sharing data between the host system and a container (some paths
under /home being the most common)
- Sharing data between unprivileged containers with a disjointed map
- Sharing data between multiple containers, some privileged, some unprivileged

Fixing the ulimit issue only takes care of one of those (3rd item), it
does not solve any of the other cases.

The first item on there alone can be quite significant. Creation and
startup of a regular Debian container on my system takes around 500ms
when shiftfs is used (btrfs/lvm/zfs copy-on-write clone of the image,
setup shiftfs, start container) compared to 2-3s when running without
it (same clone, followed by rewrite of all uid/gid present on the fs,
including acls and capabilities, then start container). And that's on
a fast system with an NVME SSD and a small rootfs. We have had reports
of a few users running on slow spinning rust with large containers
where shifting can take several minutes.

The second item can technically be worked around without shifted
bind-mounts by doing userns map hole punching, mapping the user's
uid/gid from the host straight into the container. The downside to
this is that another shifting pass becomes needed for any file outside
of the bind-mounted path (or it would become owned by -1/-1) and it's
very much not dynamic, requiring the container be stopped, config
updated by the user, /etc/subuid and subgid maps being updated and
container started back up. If you need another user/group be exposed,
start all over again...
This is far more complex, slow and disruptive than the shifted
approach where we just need to do:
lxc config device add MY-CONTAINER home disk source=/home
path=/home shift=true
To inject a new mount of /home from the host into the container with a
shifting layer in place, no need to reconfig subuid/subgid, no need to
re-create the userns to update the mapping and no need to go through
the container's rootfs for any file which may now need remapping
because of the map change.

Stéphane

> Eric
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2020-10-29 22:01:08

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 00/34] fs: idmapped mounts



> On Oct 28, 2020, at 5:35 PM, Christian Brauner <[email protected]> wrote:
>
> Hey everyone,
>
> I vanished for a little while to focus on this work here so sorry for
> not being available by mail for a while.
>
> Since quite a long time we have issues with sharing mounts between
> multiple unprivileged containers with different id mappings, sharing a
> rootfs between multiple containers with different id mappings, and also
> sharing regular directories and filesystems between users with different
> uids and gids. The latter use-cases have become even more important with
> the availability and adoption of systemd-homed (cf. [1]) to implement
> portable home directories.
>
> The solutions we have tried and proposed so far include the introduction
> of fsid mappings, a tiny overlay based filesystem, and an approach to
> call override creds in the vfs. None of these solutions have covered all
> of the above use-cases.
>
> The solution proposed here has it's origins in multiple discussions
> during Linux Plumbers 2017 during and after the end of the containers
> microconference.
> To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> James, and myself. A variant of the solution proposed here has also been
> discussed, again to the best of my knowledge, after a Linux conference
> in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> after Linux Plumbers.
> I've taken the time to finally implement a working version of this
> solution over the last weeks to the best of my abilities. Tycho has
> signed up for this sligthly crazy endeavour as well and he has helped
> with the conversion of the xattr codepaths.
>
> The core idea is to make idmappings a property of struct vfsmount
> instead of tying it to a process being inside of a user namespace which
> has been the case for all other proposed approaches.
> It means that idmappings become a property of bind-mounts, i.e. each
> bind-mount can have a separate idmapping. This has the obvious advantage
> that idmapped mounts can be created inside of the initial user
> namespace, i.e. on the host itself instead of requiring the caller to be
> located inside of a user namespace. This enables such use-cases as e.g.
> making a usb stick available in multiple locations with different
> idmappings (see the vfat port that is part of this patch series).
>
> The vfsmount struct gains a new struct user_namespace member. The
> idmapping of the user namespace becomes the idmapping of the mount. A
> caller that is either privileged with respect to the user namespace of
> the superblock of the underlying filesystem or a caller that is
> privileged with respect to the user namespace a mount has been idmapped
> with can create a new bind-mount and mark it with a user namespace.

So one way of thinking about this is that a user namespace that has an idmapped mount can, effectively, create or chown files with *any* on-disk uid or gid by doing it directly (if that uid exists in-namespace, which is likely for interesting ids like 0) or by creating a new userns with that id inside.

For a file system that is private to a container, this seems moderately safe, although this may depend on what exactly “private” means. We probably want a mechanism such that, if you are outside the namespace, a reference to a file with the namespace’s vfsmnt does not confer suid privilege.

Imagine the following attack: user creates a namespace with a root user and arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or using whatever container management tool does this. Inside the namespace, the user creates a suid-root file.

Now, outside the namespace, the user has privilege over the namespace. (I’m assuming there is some tool that will idmap things in a namespace owned by an unprivileged user, which seems likely.). So the user makes a new bind mount and if maps it to the init namespace. Game over.

So I think we need to have some control to mitigate this in a comprehensible way. A big hammer would be to require nosuid. A smaller hammer might be to say that you can’t create a new idmapped mount unless you have privilege over the userns that you want to use for the idmap and to say that a vfsmnt’s paths don’t do suid outside the idmap namespace. We already do the latter for the vfsmnt’s mntns’s userns.

Hmm. What happens if we require that an idmap userns equal the vfsmnt’s mntns’s userns? Is that too limiting?

I hope that whatever solution gets used is straightforward enough to wrap one’s head around.

> When a file/inode is accessed through an idmapped mount the i_uid and
> i_gid of the inode will be remapped according to the user namespace the
> mount has been marked with. When a new object is created based on the
> fsuid and fsgid of the caller they will similarly be remapped according
> to the user namespace of the mount they care created from.

By “mapped according to”, I presume you mean that the on-disk uid/gid is the gid as seen in the user namespace in question.


2020-10-30 02:19:29

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 00/34] fs: idmapped mounts

On Thu, Oct 29, 2020 at 11:37:23AM -0500, Eric W. Biederman wrote:
> Aleksa Sarai <[email protected]> writes:
>
> > On 2020-10-29, Eric W. Biederman <[email protected]> wrote:
> >> Christian Brauner <[email protected]> writes:
> >>
> >> > Hey everyone,
> >> >
> >> > I vanished for a little while to focus on this work here so sorry for
> >> > not being available by mail for a while.
> >> >
> >> > Since quite a long time we have issues with sharing mounts between
> >> > multiple unprivileged containers with different id mappings, sharing a
> >> > rootfs between multiple containers with different id mappings, and also
> >> > sharing regular directories and filesystems between users with different
> >> > uids and gids. The latter use-cases have become even more important with
> >> > the availability and adoption of systemd-homed (cf. [1]) to implement
> >> > portable home directories.
> >>
> >> Can you walk us through the motivating use case?
> >>
> >> As of this year's LPC I had the distinct impression that the primary use
> >> case for such a feature was due to the RLIMIT_NPROC problem where two
> >> containers with the same users still wanted different uid mappings to
> >> the disk because the users were conflicting with each other because of
> >> the per user rlimits.
> >>
> >> Fixing rlimits is straight forward to implement, and easier to manage
> >> for implementations and administrators.
> >
> > This is separate to the question of "isolated user namespaces" and
> > managing different mappings between containers. This patchset is solving
> > the same problem that shiftfs solved -- sharing a single directory tree
> > between containers that have different ID mappings. rlimits (nor any of
> > the other proposals we discussed at LPC) will help with this problem.
>
> First and foremost: A uid shift on write to a filesystem is a security
> bug waiting to happen. This is especially in the context of facilities
> like iouring, that play very agressive games with how process context
> makes it to system calls.
>
> The only reason containers were not immediately exploitable when iouring
> was introduced is because the mechanisms are built so that even if
> something escapes containment the security properties still apply.
> Changes to the uid when writing to the filesystem does not have that
> property. The tiniest slip in containment will be a security issue.
>
> This is not even the least bit theoretical. I have seem reports of how
> shitfs+overlayfs created a situation where anyone could read
> /etc/shadow.
>
> If you are going to write using the same uid to disk from different
> containers the question becomes why can't those containers configure
> those users to use the same kuid?

Because if user 'myapp' in two otherwise isolated containers both have
the same kuid, so that they can write to a shared directory, then root
in container 1 has privilege over all files owned by 'myapp' in
container 2.

Whereas if they can each have distinct kuids, but when writing to the
shared fs have a shared uid not otherwise belonging to either container,
their rootfs's can remain completely off limits to each other.

> What fixing rlimits does is it fixes one of the reasons that different
> containers could not share the same kuid for users that want to write to
> disk with the same uid.
>
>
> I humbly suggest that it will be more secure, and easier to maintain for
> both developers and users if we fix the reasons people want different
> containers to have the same user running with different kuids.
>
> If not what are the reasons we fundamentally need the same on-disk user
> using multiple kuids in the kernel?
>
> Eric

2020-10-30 12:02:53

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 00/34] fs: idmapped mounts

On Thu, Oct 29, 2020 at 02:58:55PM -0700, Andy Lutomirski wrote:
>
>
> > On Oct 28, 2020, at 5:35 PM, Christian Brauner <[email protected]> wrote:
> >
> > Hey everyone,
> >
> > I vanished for a little while to focus on this work here so sorry for
> > not being available by mail for a while.
> >
> > Since quite a long time we have issues with sharing mounts between
> > multiple unprivileged containers with different id mappings, sharing a
> > rootfs between multiple containers with different id mappings, and also
> > sharing regular directories and filesystems between users with different
> > uids and gids. The latter use-cases have become even more important with
> > the availability and adoption of systemd-homed (cf. [1]) to implement
> > portable home directories.
> >
> > The solutions we have tried and proposed so far include the introduction
> > of fsid mappings, a tiny overlay based filesystem, and an approach to
> > call override creds in the vfs. None of these solutions have covered all
> > of the above use-cases.
> >
> > The solution proposed here has it's origins in multiple discussions
> > during Linux Plumbers 2017 during and after the end of the containers
> > microconference.
> > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> > James, and myself. A variant of the solution proposed here has also been
> > discussed, again to the best of my knowledge, after a Linux conference
> > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> > after Linux Plumbers.
> > I've taken the time to finally implement a working version of this
> > solution over the last weeks to the best of my abilities. Tycho has
> > signed up for this sligthly crazy endeavour as well and he has helped
> > with the conversion of the xattr codepaths.
> >
> > The core idea is to make idmappings a property of struct vfsmount
> > instead of tying it to a process being inside of a user namespace which
> > has been the case for all other proposed approaches.
> > It means that idmappings become a property of bind-mounts, i.e. each
> > bind-mount can have a separate idmapping. This has the obvious advantage
> > that idmapped mounts can be created inside of the initial user
> > namespace, i.e. on the host itself instead of requiring the caller to be
> > located inside of a user namespace. This enables such use-cases as e.g.
> > making a usb stick available in multiple locations with different
> > idmappings (see the vfat port that is part of this patch series).
> >
> > The vfsmount struct gains a new struct user_namespace member. The
> > idmapping of the user namespace becomes the idmapping of the mount. A
> > caller that is either privileged with respect to the user namespace of
> > the superblock of the underlying filesystem or a caller that is
> > privileged with respect to the user namespace a mount has been idmapped
> > with can create a new bind-mount and mark it with a user namespace.
>
> So one way of thinking about this is that a user namespace that has an idmapped mount can, effectively, create or chown files with *any* on-disk uid or gid by doing it directly (if that uid exists in-namespace, which is likely for interesting ids like 0) or by creating a new userns with that id inside.
>
> For a file system that is private to a container, this seems moderately safe, although this may depend on what exactly “private” means. We probably want a mechanism such that, if you are outside the namespace, a reference to a file with the namespace’s vfsmnt does not confer suid privilege.
>
> Imagine the following attack: user creates a namespace with a root user and arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or using whatever container management tool does this. Inside the namespace, the user creates a suid-root file.
>
> Now, outside the namespace, the user has privilege over the namespace. (I’m assuming there is some tool that will idmap things in a namespace owned by an unprivileged user, which seems likely.). So the user makes a new bind mount and if maps it to the init namespace. Game over.
>
> So I think we need to have some control to mitigate this in a comprehensible way. A big hammer would be to require nosuid. A smaller hammer might be to say that you can’t create a new idmapped mount unless you have privilege over the userns that you want to use for the idmap and to say that a vfsmnt’s paths don’t do suid outside the idmap namespace. We already do the latter for the vfsmnt’s mntns’s userns.

With this series, in order to create an idmapped mount the user must
either be cap_sys_admin in the superblock of the underlying filesystem
or if the mount is already idmapped and they want to create another
idmapped mount from it they must have cap_sys_admin in the userns that
the mount is currrently marked with. It is also not possible to change
an idmapped mount once it has been idmapped, i.e. the user must create a
new detached bind-mount first.

>
> Hmm. What happens if we require that an idmap userns equal the vfsmnt’s mntns’s userns? Is that too limiting?
>
> I hope that whatever solution gets used is straightforward enough to wrap one’s head around.
>
> > When a file/inode is accessed through an idmapped mount the i_uid and
> > i_gid of the inode will be remapped according to the user namespace the
> > mount has been marked with. When a new object is created based on the
> > fsuid and fsgid of the caller they will similarly be remapped according
> > to the user namespace of the mount they care created from.
>
> By “mapped according to”, I presume you mean that the on-disk uid/gid is the gid as seen in the user namespace in question.

If I understand you correctly, then yes.

2020-10-30 15:10:45

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH 00/34] fs: idmapped mounts

On Thu, Oct 29, 2020 at 11:37:23AM -0500, Eric W. Biederman wrote:
> First and foremost: A uid shift on write to a filesystem is a security
> bug waiting to happen. This is especially in the context of facilities
> like iouring, that play very agressive games with how process context
> makes it to system calls.
>
> The only reason containers were not immediately exploitable when iouring
> was introduced is because the mechanisms are built so that even if
> something escapes containment the security properties still apply.
> Changes to the uid when writing to the filesystem does not have that
> property. The tiniest slip in containment will be a security issue.
>
> This is not even the least bit theoretical. I have seem reports of how
> shitfs+overlayfs created a situation where anyone could read
> /etc/shadow.

This bug was the result of a complex interaction with several
contributing factors. It's fair to say that one component was overlayfs
writing through an id-shifted mount, but the primary cause was related
to how copy-up was done coupled with allowing unprivileged overlayfs
mounts in a user ns. Checks that the mounter had access to the lower fs
file were not done before copying data up, and so the file was copied up
temporarily to the id shifted upperdir. Even though it was immediately
removed, other factors made it possible for the user to get the file
contents from the upperdir.

Regardless, I do think you raise a good point. We need to be wary of any
place the kernel could open files through a shifted mount, especially
when the open could be influenced by userspace.

Perhaps kernel file opens through shifted mounts should to be opt-in.
I.e. unless a flag is passed, or a different open interface used, the
open will fail if the dentry being opened is subject to id shifting.
This way any kernel writes which would be subject to id shifting will
only happen through code which as been written to take it into account.

Seth

2020-10-30 16:04:17

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 00/34] fs: idmapped mounts

On Fri, Oct 30, 2020 at 10:07:48AM -0500, Seth Forshee wrote:
> On Thu, Oct 29, 2020 at 11:37:23AM -0500, Eric W. Biederman wrote:
> > First and foremost: A uid shift on write to a filesystem is a security
> > bug waiting to happen. This is especially in the context of facilities
> > like iouring, that play very agressive games with how process context
> > makes it to system calls.
> >
> > The only reason containers were not immediately exploitable when iouring
> > was introduced is because the mechanisms are built so that even if
> > something escapes containment the security properties still apply.
> > Changes to the uid when writing to the filesystem does not have that
> > property. The tiniest slip in containment will be a security issue.
> >
> > This is not even the least bit theoretical. I have seem reports of how
> > shitfs+overlayfs created a situation where anyone could read
> > /etc/shadow.
>
> This bug was the result of a complex interaction with several
> contributing factors. It's fair to say that one component was overlayfs
> writing through an id-shifted mount, but the primary cause was related
> to how copy-up was done coupled with allowing unprivileged overlayfs
> mounts in a user ns. Checks that the mounter had access to the lower fs
> file were not done before copying data up, and so the file was copied up
> temporarily to the id shifted upperdir. Even though it was immediately
> removed, other factors made it possible for the user to get the file
> contents from the upperdir.
>
> Regardless, I do think you raise a good point. We need to be wary of any
> place the kernel could open files through a shifted mount, especially
> when the open could be influenced by userspace.
>
> Perhaps kernel file opens through shifted mounts should to be opt-in.
> I.e. unless a flag is passed, or a different open interface used, the
> open will fail if the dentry being opened is subject to id shifting.
> This way any kernel writes which would be subject to id shifting will
> only happen through code which as been written to take it into account.

For my use cases, it would be fine to require opt-in at original fs
mount time by init_user_ns admin. I.e.
mount -o allow_idmap /dev/mapper/whoozit /whatzit
I'm quite certain I would always be sharing a separate LV or loopback or
tmpfs.

-serge

2020-10-30 16:17:50

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 00/34] fs: idmapped mounts

On Fri, Oct 30, 2020 at 01:01:57PM +0100, Christian Brauner wrote:
> On Thu, Oct 29, 2020 at 02:58:55PM -0700, Andy Lutomirski wrote:
> >
> >
> > > On Oct 28, 2020, at 5:35 PM, Christian Brauner <[email protected]> wrote:
> > >
> > > Hey everyone,
> > >
> > > I vanished for a little while to focus on this work here so sorry for
> > > not being available by mail for a while.
> > >
> > > Since quite a long time we have issues with sharing mounts between
> > > multiple unprivileged containers with different id mappings, sharing a
> > > rootfs between multiple containers with different id mappings, and also
> > > sharing regular directories and filesystems between users with different
> > > uids and gids. The latter use-cases have become even more important with
> > > the availability and adoption of systemd-homed (cf. [1]) to implement
> > > portable home directories.
> > >
> > > The solutions we have tried and proposed so far include the introduction
> > > of fsid mappings, a tiny overlay based filesystem, and an approach to
> > > call override creds in the vfs. None of these solutions have covered all
> > > of the above use-cases.
> > >
> > > The solution proposed here has it's origins in multiple discussions
> > > during Linux Plumbers 2017 during and after the end of the containers
> > > microconference.
> > > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> > > James, and myself. A variant of the solution proposed here has also been
> > > discussed, again to the best of my knowledge, after a Linux conference
> > > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> > > after Linux Plumbers.
> > > I've taken the time to finally implement a working version of this
> > > solution over the last weeks to the best of my abilities. Tycho has
> > > signed up for this sligthly crazy endeavour as well and he has helped
> > > with the conversion of the xattr codepaths.
> > >
> > > The core idea is to make idmappings a property of struct vfsmount
> > > instead of tying it to a process being inside of a user namespace which
> > > has been the case for all other proposed approaches.
> > > It means that idmappings become a property of bind-mounts, i.e. each
> > > bind-mount can have a separate idmapping. This has the obvious advantage
> > > that idmapped mounts can be created inside of the initial user
> > > namespace, i.e. on the host itself instead of requiring the caller to be
> > > located inside of a user namespace. This enables such use-cases as e.g.
> > > making a usb stick available in multiple locations with different
> > > idmappings (see the vfat port that is part of this patch series).
> > >
> > > The vfsmount struct gains a new struct user_namespace member. The
> > > idmapping of the user namespace becomes the idmapping of the mount. A
> > > caller that is either privileged with respect to the user namespace of
> > > the superblock of the underlying filesystem or a caller that is
> > > privileged with respect to the user namespace a mount has been idmapped
> > > with can create a new bind-mount and mark it with a user namespace.
> >
> > So one way of thinking about this is that a user namespace that has an idmapped mount can, effectively, create or chown files with *any* on-disk uid or gid by doing it directly (if that uid exists in-namespace, which is likely for interesting ids like 0) or by creating a new userns with that id inside.
> >
> > For a file system that is private to a container, this seems moderately safe, although this may depend on what exactly “private” means. We probably want a mechanism such that, if you are outside the namespace, a reference to a file with the namespace’s vfsmnt does not confer suid privilege.
> >
> > Imagine the following attack: user creates a namespace with a root user and arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or using whatever container management tool does this. Inside the namespace, the user creates a suid-root file.
> >
> > Now, outside the namespace, the user has privilege over the namespace. (I’m assuming there is some tool that will idmap things in a namespace owned by an unprivileged user, which seems likely.). So the user makes a new bind mount and if maps it to the init namespace. Game over.
> >
> > So I think we need to have some control to mitigate this in a comprehensible way. A big hammer would be to require nosuid. A smaller hammer might be to say that you can’t create a new idmapped mount unless you have privilege over the userns that you want to use for the idmap and to say that a vfsmnt’s paths don’t do suid outside the idmap namespace. We already do the latter for the vfsmnt’s mntns’s userns.
>
> With this series, in order to create an idmapped mount the user must
> either be cap_sys_admin in the superblock of the underlying filesystem
> or if the mount is already idmapped and they want to create another
> idmapped mount from it they must have cap_sys_admin in the userns that
> the mount is currrently marked with. It is also not possible to change
> an idmapped mount once it has been idmapped, i.e. the user must create a
> new detached bind-mount first.

Yeah I spent quite some time last night trying to figure out the scenario
you were presenting, but I failed. Andy, could you either rephrase or
give a more concrete end to end attack scenario?

2020-10-31 17:44:32

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 00/34] fs: idmapped mounts

On Fri, Oct 30, 2020 at 5:02 AM Christian Brauner
<[email protected]> wrote:
>
> On Thu, Oct 29, 2020 at 02:58:55PM -0700, Andy Lutomirski wrote:
> >
> >
> > > On Oct 28, 2020, at 5:35 PM, Christian Brauner <[email protected]> wrote:
> > >
> > > Hey everyone,
> > >
> > > I vanished for a little while to focus on this work here so sorry for
> > > not being available by mail for a while.
> > >
> > > Since quite a long time we have issues with sharing mounts between
> > > multiple unprivileged containers with different id mappings, sharing a
> > > rootfs between multiple containers with different id mappings, and also
> > > sharing regular directories and filesystems between users with different
> > > uids and gids. The latter use-cases have become even more important with
> > > the availability and adoption of systemd-homed (cf. [1]) to implement
> > > portable home directories.
> > >
> > > The solutions we have tried and proposed so far include the introduction
> > > of fsid mappings, a tiny overlay based filesystem, and an approach to
> > > call override creds in the vfs. None of these solutions have covered all
> > > of the above use-cases.
> > >
> > > The solution proposed here has it's origins in multiple discussions
> > > during Linux Plumbers 2017 during and after the end of the containers
> > > microconference.
> > > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> > > James, and myself. A variant of the solution proposed here has also been
> > > discussed, again to the best of my knowledge, after a Linux conference
> > > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> > > after Linux Plumbers.
> > > I've taken the time to finally implement a working version of this
> > > solution over the last weeks to the best of my abilities. Tycho has
> > > signed up for this sligthly crazy endeavour as well and he has helped
> > > with the conversion of the xattr codepaths.
> > >
> > > The core idea is to make idmappings a property of struct vfsmount
> > > instead of tying it to a process being inside of a user namespace which
> > > has been the case for all other proposed approaches.
> > > It means that idmappings become a property of bind-mounts, i.e. each
> > > bind-mount can have a separate idmapping. This has the obvious advantage
> > > that idmapped mounts can be created inside of the initial user
> > > namespace, i.e. on the host itself instead of requiring the caller to be
> > > located inside of a user namespace. This enables such use-cases as e.g.
> > > making a usb stick available in multiple locations with different
> > > idmappings (see the vfat port that is part of this patch series).
> > >
> > > The vfsmount struct gains a new struct user_namespace member. The
> > > idmapping of the user namespace becomes the idmapping of the mount. A
> > > caller that is either privileged with respect to the user namespace of
> > > the superblock of the underlying filesystem or a caller that is
> > > privileged with respect to the user namespace a mount has been idmapped
> > > with can create a new bind-mount and mark it with a user namespace.
> >
> > So one way of thinking about this is that a user namespace that has an idmapped mount can, effectively, create or chown files with *any* on-disk uid or gid by doing it directly (if that uid exists in-namespace, which is likely for interesting ids like 0) or by creating a new userns with that id inside.
> >
> > For a file system that is private to a container, this seems moderately safe, although this may depend on what exactly “private” means. We probably want a mechanism such that, if you are outside the namespace, a reference to a file with the namespace’s vfsmnt does not confer suid privilege.
> >
> > Imagine the following attack: user creates a namespace with a root user and arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or using whatever container management tool does this. Inside the namespace, the user creates a suid-root file.
> >
> > Now, outside the namespace, the user has privilege over the namespace. (I’m assuming there is some tool that will idmap things in a namespace owned by an unprivileged user, which seems likely.). So the user makes a new bind mount and if maps it to the init namespace. Game over.
> >
> > So I think we need to have some control to mitigate this in a comprehensible way. A big hammer would be to require nosuid. A smaller hammer might be to say that you can’t create a new idmapped mount unless you have privilege over the userns that you want to use for the idmap and to say that a vfsmnt’s paths don’t do suid outside the idmap namespace. We already do the latter for the vfsmnt’s mntns’s userns.
>
> With this series, in order to create an idmapped mount the user must
> either be cap_sys_admin in the superblock of the underlying filesystem
> or if the mount is already idmapped and they want to create another
> idmapped mount from it they must have cap_sys_admin in the userns that
> the mount is currrently marked with. It is also not possible to change
> an idmapped mount once it has been idmapped, i.e. the user must create a
> new detached bind-mount first.

I think my attack might not work, but I also think I didn't explain it
very well. Let me try again. I'll also try to lay out what I
understand the rules of idmaps to be so that you can correct me when
I'm inevitable wrong :)

First, background: there are a bunch of user namespaces around. Every
superblock has one, every idmapped mount has one, and every vfsmnt
also (indirectly) has one: mnt->mnt_ns->user_ns. So, if you're
looking at a given vfsmnt, you have three user namespaces that are
relevant, in addition to whatever namespaces are active for the task
(or kernel thread) accessing that mount. I'm wondering whether
mnt_user_ns() should possibly have a name that makes it clear that it
refers to the idmap namespace and not mnt->mnt_ns->user_ns.

So here's the attack. An attacker with uid=1000 creates a userns N
(so the attacker owns the ns and 1000 outside maps to 0 inside). N is
a child of init_user_ns. Now the attacker creates a mount namespace M
inside the userns and, potentially with the help of a container
management tool, creates an idmapped filesystem mount F inside M. So,
playing fast and loose with my ampersands:

F->mnt_ns == M
F->mnt_ns->user_ns == N
mnt_user_ns(F) == N

I expect that this wouldn't be a particularly uncommon setup. Now the
user has the ability to create files with inode->uid == 0 and the SUID
bit set on their filesystem. This isn't terribly different from FUSE,
except that the mount won't have nosuid set, whereas at least many
uses of unprivileged FUSE would have nosuid set. So the thing that
makes me a little bit nervous. But it actually seems likely that I
was wrong and this is okay. Specifically, to exploit this using
kernel mechanisms, one would need to pass a mnt_may_suid() check,
which means that one would need to acquire a mount of F in one's
current mount namespace, and one would need one's current user
namespace to be init_ns (or something else sensitive). But you
already need to own the namespace to create mounts, unless you have a
way to confuse some existing user tooling. You would also need to be
in F's superblock's user_ns (second line of mnt_may_suid()), which
totally kills this type of attack if F's superblock is in the
container's user_ns, but I wouldn't count on that.

So maybe this is all fine. I'll continue to try to poke holes in it,
but perhaps there aren't any holes to poke. I'll also continue to try
to see if I can state the security properties of idmap in a way that
is clear and obviously has nice properties.

Why are you allowing the creation of a new idmapped mount if you have
cap_sys_admin over an existing idmap userns but not over the
superblock's userns? I assume this is for a nested container use
case, but can you spell out a specific example usage?

--Andy

2020-11-01 14:47:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 06/34] fs: add id translation helpers

> +static inline kuid_t kuid_into_mnt(struct user_namespace *to, kuid_t kuid)
> +{
> +#ifdef CONFIG_IDMAP_MOUNTS
> + return make_kuid(to, __kuid_val(kuid));
> +#else
> + return kuid;
> +#endif
> +}
> +
> +static inline kgid_t kgid_into_mnt(struct user_namespace *to, kgid_t kgid)
> +{
> +#ifdef CONFIG_IDMAP_MOUNTS
> + return make_kgid(to, __kgid_val(kgid));
> +#else
> + return kgid;
> +#endif

If you want to keep the config option please at least have on
#ifdef/#else/#endif instead of this mess.

2020-11-02 13:25:52

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 06/34] fs: add id translation helpers

On Sun, Nov 01, 2020 at 02:46:32PM +0000, Christoph Hellwig wrote:
> > +static inline kuid_t kuid_into_mnt(struct user_namespace *to, kuid_t kuid)
> > +{
> > +#ifdef CONFIG_IDMAP_MOUNTS
> > + return make_kuid(to, __kuid_val(kuid));
> > +#else
> > + return kuid;
> > +#endif
> > +}
> > +
> > +static inline kgid_t kgid_into_mnt(struct user_namespace *to, kgid_t kgid)
> > +{
> > +#ifdef CONFIG_IDMAP_MOUNTS
> > + return make_kgid(to, __kgid_val(kgid));
> > +#else
> > + return kgid;
> > +#endif
>
> If you want to keep the config option please at least have on
> #ifdef/#else/#endif instead of this mess.

Understood.

2020-11-03 14:12:17

by Alban Crequy

[permalink] [raw]
Subject: Re: [PATCH 00/34] fs: idmapped mounts

On Thu, Oct 29, 2020 at 5:37 PM Eric W. Biederman <[email protected]> wrote:
>
> Aleksa Sarai <[email protected]> writes:
>
> > On 2020-10-29, Eric W. Biederman <[email protected]> wrote:
> >> Christian Brauner <[email protected]> writes:
> >>
> >> > Hey everyone,
> >> >
> >> > I vanished for a little while to focus on this work here so sorry for
> >> > not being available by mail for a while.
> >> >
> >> > Since quite a long time we have issues with sharing mounts between
> >> > multiple unprivileged containers with different id mappings, sharing a
> >> > rootfs between multiple containers with different id mappings, and also
> >> > sharing regular directories and filesystems between users with different
> >> > uids and gids. The latter use-cases have become even more important with
> >> > the availability and adoption of systemd-homed (cf. [1]) to implement
> >> > portable home directories.
> >>
> >> Can you walk us through the motivating use case?
> >>
> >> As of this year's LPC I had the distinct impression that the primary use
> >> case for such a feature was due to the RLIMIT_NPROC problem where two
> >> containers with the same users still wanted different uid mappings to
> >> the disk because the users were conflicting with each other because of
> >> the per user rlimits.
> >>
> >> Fixing rlimits is straight forward to implement, and easier to manage
> >> for implementations and administrators.
> >
> > This is separate to the question of "isolated user namespaces" and
> > managing different mappings between containers. This patchset is solving
> > the same problem that shiftfs solved -- sharing a single directory tree
> > between containers that have different ID mappings. rlimits (nor any of
> > the other proposals we discussed at LPC) will help with this problem.
>
> First and foremost: A uid shift on write to a filesystem is a security
> bug waiting to happen. This is especially in the context of facilities
> like iouring, that play very agressive games with how process context
> makes it to system calls.
>
> The only reason containers were not immediately exploitable when iouring
> was introduced is because the mechanisms are built so that even if
> something escapes containment the security properties still apply.
> Changes to the uid when writing to the filesystem does not have that
> property. The tiniest slip in containment will be a security issue.
>
> This is not even the least bit theoretical. I have seem reports of how
> shitfs+overlayfs created a situation where anyone could read
> /etc/shadow.
>
> If you are going to write using the same uid to disk from different
> containers the question becomes why can't those containers configure
> those users to use the same kuid?
>
> What fixing rlimits does is it fixes one of the reasons that different
> containers could not share the same kuid for users that want to write to
> disk with the same uid.
>
>
> I humbly suggest that it will be more secure, and easier to maintain for
> both developers and users if we fix the reasons people want different
> containers to have the same user running with different kuids.
>
> If not what are the reasons we fundamentally need the same on-disk user
> using multiple kuids in the kernel?

I would like to use this patch set in the context of Kubernetes. I
described my two possible setups in
https://www.spinics.net/lists/linux-containers/msg36537.html:

1. Each Kubernetes pod has its own userns but with the same user id mapping
2. Each Kubernetes pod has its own userns with non-overlapping user id
mapping (providing additional isolation between pods)

But even in the setup where all pods run with the same id mappings,
this patch set is still useful to me for 2 reasons:

1. To avoid the expensive recursive chown of the rootfs. We cannot
necessarily extract the tarball directly with the right uids because
we might use the same container image for privileged containers (with
the host userns) and unprivileged containers (with a new userns), so
we have at least 2 “mappings” (taking more time and resulting in more
storage space). Although the “metacopy” mount option in overlayfs
helps to make the recursive chown faster, it can still take time with
large container images with lots of files. I’d like to use this patch
set to set up the root fs in constant time.

2. To manage large external volumes (NFS or other filesystems). Even
if admins can decide to use the same kuid on all the nodes of the
Kubernetes cluster, this is impractical for migration. People can have
existing Kubernetes clusters (currently without using user namespaces)
and large NFS volumes. If they want to switch to a new version of
Kubernetes with the user namespace feature enabled, they would need to
recursively chown all the files on the NFS shares. This could take
time on large filesystems and realistically, we want to support
rolling updates where some nodes use the previous version without user
namespaces and new nodes are progressively migrated to the new userns
with the new id mapping. If both sets of nodes use the same NFS share,
that can’t work.

Alban