Hey everyone,
This is v3 after (off- and online) discussions with Jann the following
changes were made:
- To handle nested user namespaces cleanly, efficiently, and with full
backwards compatibility for non fsid-mapping aware workloads we only
allow writing fsid mappings as long as the corresponding id mapping
type has not been written.
- Split the patch which adds the internal ability in
kernel/user_namespace to verify and write fsid mappings into tree
patches:
1. [PATCH v3 04/25] fsuidgid: add fsid mapping helpers
patch to implement core helpers for fsid translations (i.e.
make_kfs*id(), from_kfs*id{_munged}(), kfs*id_to_k*id(),
k*id_to_kfs*id()
2. [PATCH v3 05/25] user_namespace: refactor map_write()
patch to refactor map_write() in order to prepare for actual fsid
mappings changes in the following patch. (This should make it
easier to review.)
3. [PATCH v3 06/25] user_namespace: make map_write() support fsid mappings
patch to implement actual fsid mappings support in mape_write()
- Let the keyctl infrastructure only operate on kfsid which are always
mapped/looked up in the id mappings similar to what we do for
filesystems that have the same superblock visible in multiple user
namespaces.
This version also comes with minimal tests which I intend to expand in
the future.
From pings and off-list questions and discussions at Google Container
Security Summit there seems to be quite a lot of interest in this
patchset with use-cases ranging from layer sharing for app containers
and k8s, as well as data sharing between containers with different id
mappings. I haven't Cced all people because I don't have all the email
adresses at hand but I've at least added Phil now. :)
This is the implementation of shiftfs which was cooked up during lunch at
Linux Plumbers 2019 the day after the container's microconference. The
idea is a design-stew from Stéphane, Aleksa, Eric, and myself (and by
now also Jann.
Back then we all were quite busy with other work and couldn't really sit
down and implement it. But I took a few days last week to do this work,
including demos and performance testing.
This implementation does not require us to touch the VFS substantially
at all. Instead, we implement shiftfs via fsid mappings.
With this patch, it took me 20 mins to port both LXD and LXC to support
shiftfs via fsid mappings.
For anyone wanting to play with this the branch can be pulled from:
https://github.com/brauner/linux/tree/fsid_mappings
https://gitlab.com/brauner/linux/-/tree/fsid_mappings
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=fsid_mappings
The main use case for shiftfs for us is in allowing shared writable
storage to multiple containers using non-overlapping id mappings.
In such a scenario you want the fsids to be valid and identical in both
containers for the shared mount. A demo for this exists in [3].
If you don't want to read on, go straight to the other demos below in
[1] and [2].
People not as familiar with user namespaces might not be aware that fsid
mappings already exist. Right now, fsid mappings are always identical to
id mappings. Specifically, the kernel will lookup fsuids in the uid
mappings and fsgids in the gid mappings of the relevant user namespace.
With this patch series we simply introduce the ability to create fsid
mappings that are different from the id mappings of a user namespace.
The whole feature set is placed under a config option that defaults to
false.
In the usual case of running an unprivileged container we will have
setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
correspond to this id mapping, i.e. all files which we want to appear as
0:0 inside the user namespace will be chowned to 100000:100000 on the
host. This works, because whenever the kernel needs to do a filesystem
access it will lookup the corresponding uid and gid in the idmapping
tables of the container.
Now think about the case where we want to have an id mapping of 0 100000
100000 but an on-disk mapping of 0 300000 100000 which is needed to e.g.
share a single on-disk mapping with multiple containers that all have
different id mappings.
This will be problematic. Whenever a filesystem access is requested, the
kernel will now try to lookup a mapping for 300000 in the id mapping
tables of the user namespace but since there is none the files will
appear to be owned by the overflow id, i.e. usually 65534:65534 or
nobody:nogroup.
With fsid mappings we can solve this by writing an id mapping of 0
100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
access the kernel will now lookup the mapping for 300000 in the fsid
mapping tables of the user namespace. And since such a mapping exists,
the corresponding files will have correct ownership.
A note on proc (and sys), the proc filesystem is special in sofar as it
only has a single superblock that is (currently but might be about to
change) visible in all user namespaces (same goes for sys). This means
it has special semantics in many ways, including how file ownership and
access works. The fsid mapping implementation does not alter how proc
(and sys) ownership works. proc and sys will both continue to lookup
filesystem access in id mapping tables.
When Writing fsid mappings the same rules apply as when writing id
mappings so I won't reiterate them here. The limit of fs id mappings is
the same as for id mappings, i.e. 340 lines.
# Performance
Back when I extended the range of possible id mappings to 340 I did
performance testing by booting into single user mode, creating 1,000,000
files to fstat()ing them and calculated the mean fstat() time per file.
(Back when Linux was still fast. I won't mention that the stat
numbers have (thanks microcode!) doubled since then...)
I did the same test for this patchset: one vanilla kernel, one kernel
with my fsid mapping patches but CONFIG_USER_NS_FSID set to n and one
with fsid mappings patches enabled. I then ran the same test on all
three kernels and compared the numbers. The implementation does not
introduce overhead. That's all I can say. Here are the numbers:
| vanilla v5.5 | fsid mappings | fsid mappings | fsid mappings |
| | disabled in Kconfig | enabled in Kconfig | enabled in Kconfig |
| | | and unset for all | and set for all |
| | | test cases | test cases |
-------------|--------------|---------------------|--------------------|--------------------|
0 mappings | 367 ns | 365 ns | 365 ns | N/A |
1 mappings | 362 ns | 367 ns | 363 ns | 363 ns |
2 mappings | 361 ns | 369 ns | 363 ns | 364 ns |
3 mappings | 361 ns | 368 ns | 366 ns | 365 ns |
5 mappings | 365 ns | 368 ns | 363 ns | 365 ns |
10 mappings | 391 ns | 388 ns | 387 ns | 389 ns |
50 mappings | 395 ns | 398 ns | 401 ns | 397 ns |
100 mappings | 400 ns | 405 ns | 399 ns | 399 ns |
200 mappings | 404 ns | 407 ns | 430 ns | 404 ns |
300 mappings | 492 ns | 494 ns | 432 ns | 413 ns |
340 mappings | 495 ns | 497 ns | 500 ns | 484 ns |
# Demos
[1]: Create a container with different id and fsid mappings.
https://asciinema.org/a/300233
[2]: Create a container with id mappings but without fsid mappings.
https://asciinema.org/a/300234
[3]: Share storage between multiple containers with non-overlapping id
mappings.
https://asciinema.org/a/300235
Thanks!
Christian
Christian Brauner (25):
user_namespace: introduce fsid mappings infrastructure
proc: add /proc/<pid>/fsuid_map
proc: add /proc/<pid>/fsgid_map
fsuidgid: add fsid mapping helpers
user_namespace: refactor map_write()
user_namespace: make map_write() support fsid mappings
proc: task_state(): use from_kfs{g,u}id_munged
cred: add kfs{g,u}id
fs: add is_userns_visible() helper
namei: may_{o_}create(): handle fsid mappings
inode: inode_owner_or_capable(): handle fsid mappings
capability: privileged_wrt_inode_uidgid(): handle fsid mappings
stat: handle fsid mappings
open: handle fsid mappings
posix_acl: handle fsid mappings
attr: notify_change(): handle fsid mappings
commoncap: cap_bprm_set_creds(): handle fsid mappings
commoncap: cap_task_fix_setuid(): handle fsid mappings
commoncap: handle fsid mappings with vfs caps
exec: bprm_fill_uid(): handle fsid mappings
ptrace: adapt ptrace_may_access() to always uses unmapped fsids
devpts: handle fsid mappings
keys: handle fsid mappings
sys: handle fsid mappings in set*id() calls
selftests: add simple fsid mapping selftests
fs/attr.c | 23 +-
fs/devpts/inode.c | 7 +-
fs/exec.c | 25 +-
fs/inode.c | 7 +-
fs/namei.c | 36 +-
fs/open.c | 16 +-
fs/posix_acl.c | 17 +-
fs/proc/array.c | 5 +-
fs/proc/base.c | 34 ++
fs/stat.c | 48 +-
include/linux/cred.h | 4 +
include/linux/fs.h | 5 +
include/linux/fsuidgid.h | 122 +++++
include/linux/stat.h | 1 +
include/linux/user_namespace.h | 10 +
init/Kconfig | 11 +
kernel/capability.c | 10 +-
kernel/ptrace.c | 4 +-
kernel/sys.c | 106 +++-
kernel/user.c | 22 +
kernel/user_namespace.c | 517 ++++++++++++++++--
security/commoncap.c | 35 +-
security/keys/key.c | 2 +-
security/keys/permission.c | 4 +-
security/keys/process_keys.c | 6 +-
security/keys/request_key.c | 10 +-
security/keys/request_key_auth.c | 2 +-
tools/testing/selftests/Makefile | 1 +
.../testing/selftests/user_namespace/Makefile | 11 +
.../selftests/user_namespace/test_fsid_map.c | 511 +++++++++++++++++
30 files changed, 1461 insertions(+), 151 deletions(-)
create mode 100644 include/linux/fsuidgid.h
create mode 100644 tools/testing/selftests/user_namespace/Makefile
create mode 100644 tools/testing/selftests/user_namespace/test_fsid_map.c
base-commit: bb6d3fb354c5ee8d6bde2d576eb7220ea09862b9
--
2.25.0
- Verify that fsid mappings cannot be written when if mappings have been
written already.
- Set up an id mapping and an fsid mapping, create a file and compare ids in
child and parent user namespace.
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
patch not present
/* v3 */
patch added
---
tools/testing/selftests/Makefile | 1 +
.../testing/selftests/user_namespace/Makefile | 11 +
.../selftests/user_namespace/test_fsid_map.c | 511 ++++++++++++++++++
3 files changed, 523 insertions(+)
create mode 100644 tools/testing/selftests/user_namespace/Makefile
create mode 100644 tools/testing/selftests/user_namespace/test_fsid_map.c
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 63430e2664c2..49dcd21d2be7 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -60,6 +60,7 @@ endif
TARGETS += tmpfs
TARGETS += tpm2
TARGETS += user
+TARGETS += user_namespace
TARGETS += vm
TARGETS += x86
TARGETS += zram
diff --git a/tools/testing/selftests/user_namespace/Makefile b/tools/testing/selftests/user_namespace/Makefile
new file mode 100644
index 000000000000..3f89896f3285
--- /dev/null
+++ b/tools/testing/selftests/user_namespace/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+CFLAGS += -Wall
+
+all:
+
+TEST_GEN_PROGS += test_fsid_map
+
+include ../lib.mk
+
+$(OUTPUT)/test_fsid_map: test_fsid_map.c ../clone3/clone3_selftests.h
+
diff --git a/tools/testing/selftests/user_namespace/test_fsid_map.c b/tools/testing/selftests/user_namespace/test_fsid_map.c
new file mode 100644
index 000000000000..e278f137ff55
--- /dev/null
+++ b/tools/testing/selftests/user_namespace/test_fsid_map.c
@@ -0,0 +1,511 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <grp.h>
+#include <inttypes.h>
+#include <libgen.h>
+#include <stdbool.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <linux/sched.h>
+#include <sys/fsuid.h>
+#include <sys/mount.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+#include "../clone3/clone3_selftests.h"
+
+static int wait_for_pid(pid_t pid)
+{
+ int status, ret;
+
+again:
+ ret = waitpid(pid, &status, 0);
+ if (ret == -1) {
+ if (errno == EINTR)
+ goto again;
+
+ return -1;
+ }
+
+ if (!WIFEXITED(status))
+ return -1;
+
+ return WEXITSTATUS(status);
+}
+
+static int setid_userns_root(void)
+{
+ if (setuid(0))
+ return -1;
+ if (setgid(0))
+ return -1;
+
+ setfsuid(0);
+ setfsgid(0);
+
+ if (setfsuid(0))
+ return -1;
+
+ if (setfsgid(0))
+ return -1;
+
+ return 0;
+}
+
+enum idmap_type {
+ UID_MAP,
+ GID_MAP,
+ FSUID_MAP,
+ FSGID_MAP,
+};
+
+static ssize_t read_nointr(int fd, void *buf, size_t count)
+{
+ ssize_t ret;
+again:
+ ret = read(fd, buf, count);
+ if (ret < 0 && errno == EINTR)
+ goto again;
+
+ return ret;
+}
+
+static ssize_t write_nointr(int fd, const void *buf, size_t count)
+{
+ ssize_t ret;
+again:
+ ret = write(fd, buf, count);
+ if (ret < 0 && errno == EINTR)
+ goto again;
+
+ return ret;
+}
+
+static int write_id_mapping(enum idmap_type type, pid_t pid, const char *buf,
+ size_t buf_size)
+{
+ int fd;
+ int ret;
+ char path[4096];
+
+ switch (type) {
+ case UID_MAP:
+ ret = snprintf(path, sizeof(path), "/proc/%d/uid_map", pid);
+ break;
+ case GID_MAP:
+ ret = snprintf(path, sizeof(path), "/proc/%d/gid_map", pid);
+ break;
+ case FSUID_MAP:
+ ret = snprintf(path, sizeof(path), "/proc/%d/fsuid_map", pid);
+ break;
+ case FSGID_MAP:
+ ret = snprintf(path, sizeof(path), "/proc/%d/fsgid_map", pid);
+ break;
+ default:
+ return -1;
+ }
+ if (ret < 0 || ret >= sizeof(path))
+ return -E2BIG;
+
+ fd = open(path, O_WRONLY);
+ if (fd < 0)
+ return -1;
+
+ ret = write_nointr(fd, buf, buf_size);
+ close(fd);
+ if (ret != buf_size)
+ return -1;
+
+ return 0;
+}
+
+const char id_map[] = "0 100000 100000";
+#define id_map_size (sizeof(id_map) - 1)
+
+const char fsid_map[] = "0 300000 100000";
+#define fsid_map_size (sizeof(fsid_map) - 1)
+
+int unix_send_fds_iov(int fd, int *sendfds, int num_sendfds, struct iovec *iov,
+ size_t iovlen)
+{
+ char *cmsgbuf = NULL;
+ int ret;
+ struct msghdr msg;
+ struct cmsghdr *cmsg = NULL;
+ size_t cmsgbufsize = CMSG_SPACE(num_sendfds * sizeof(int));
+
+ memset(&msg, 0, sizeof(msg));
+
+ cmsgbuf = malloc(cmsgbufsize);
+ if (!cmsgbuf) {
+ errno = ENOMEM;
+ return -1;
+ }
+
+ msg.msg_control = cmsgbuf;
+ msg.msg_controllen = cmsgbufsize;
+
+ cmsg = CMSG_FIRSTHDR(&msg);
+ cmsg->cmsg_level = SOL_SOCKET;
+ cmsg->cmsg_type = SCM_RIGHTS;
+ cmsg->cmsg_len = CMSG_LEN(num_sendfds * sizeof(int));
+
+ msg.msg_controllen = cmsg->cmsg_len;
+
+ memcpy(CMSG_DATA(cmsg), sendfds, num_sendfds * sizeof(int));
+
+ msg.msg_iov = iov;
+ msg.msg_iovlen = iovlen;
+
+again:
+ ret = sendmsg(fd, &msg, MSG_NOSIGNAL);
+ if (ret < 0)
+ if (errno == EINTR)
+ goto again;
+
+ free(cmsgbuf);
+ return ret;
+}
+
+static int unix_send_fds(int fd, int *sendfds, int num_sendfds, void *data,
+ size_t size)
+{
+ char buf[1] = {0};
+ struct iovec iov = {
+ .iov_base = data ? data : buf,
+ .iov_len = data ? size : sizeof(buf),
+ };
+ return unix_send_fds_iov(fd, sendfds, num_sendfds, &iov, 1);
+}
+
+static int unix_recv_fds_iov(int fd, int *recvfds, int num_recvfds,
+ struct iovec *iov, size_t iovlen)
+{
+ char *cmsgbuf = NULL;
+ int ret;
+ struct msghdr msg;
+ struct cmsghdr *cmsg = NULL;
+ size_t cmsgbufsize = CMSG_SPACE(sizeof(struct ucred)) +
+ CMSG_SPACE(num_recvfds * sizeof(int));
+
+ memset(&msg, 0, sizeof(msg));
+
+ cmsgbuf = malloc(cmsgbufsize);
+ if (!cmsgbuf) {
+ errno = ENOMEM;
+ return -1;
+ }
+
+ msg.msg_control = cmsgbuf;
+ msg.msg_controllen = cmsgbufsize;
+
+ msg.msg_iov = iov;
+ msg.msg_iovlen = iovlen;
+
+again:
+ ret = recvmsg(fd, &msg, 0);
+ if (ret < 0) {
+ if (errno == EINTR)
+ goto again;
+
+ goto out;
+ }
+ if (ret == 0)
+ goto out;
+
+ /*
+ * If SO_PASSCRED is set we will always get a ucred message.
+ */
+ for (cmsg = CMSG_FIRSTHDR(&msg); cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) {
+ if (cmsg->cmsg_type != SCM_RIGHTS)
+ continue;
+
+ memset(recvfds, -1, num_recvfds * sizeof(int));
+ if (cmsg &&
+ cmsg->cmsg_len == CMSG_LEN(num_recvfds * sizeof(int)) &&
+ cmsg->cmsg_level == SOL_SOCKET)
+ memcpy(recvfds, CMSG_DATA(cmsg), num_recvfds * sizeof(int));
+ break;
+ }
+
+out:
+ free(cmsgbuf);
+ return ret;
+}
+
+static int unix_recv_fds(int fd, int *recvfds, int num_recvfds, void *data,
+ size_t size)
+{
+ char buf[1] = {0};
+ struct iovec iov = {
+ .iov_base = data ? data : buf,
+ .iov_len = data ? size : sizeof(buf),
+ };
+ return unix_recv_fds_iov(fd, recvfds, num_recvfds, &iov, 1);
+}
+
+static bool has_expected_owner(int fd, uid_t uid, gid_t gid)
+{
+ int ret;
+ struct stat s;
+ ret = fstat(fd, &s);
+ return !ret && s.st_uid == uid && s.st_gid == gid;
+}
+
+static int make_file_cmp_owner(uid_t uid, gid_t gid)
+{
+ char template[] = P_tmpdir "/.fsid_map_test_XXXXXX";
+ int fd;
+
+ fd = mkstemp(template);
+ if (fd < 0)
+ return -1;
+ unlink(template);
+
+ if (!has_expected_owner(fd, uid, gid)) {
+ close(fd);
+ return -1;
+ }
+
+ return fd;
+}
+
+static void test_id_maps_imply_fsid_maps(void)
+{
+ int fret = EXIT_FAILURE;
+ ssize_t ret;
+ int fd = -EBADF;
+ pid_t pid;
+ int ipc[2];
+ struct clone_args args = {
+ .flags = CLONE_NEWUSER,
+ .exit_signal = SIGCHLD,
+ };
+
+ ret = socketpair(PF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc);
+ if (ret < 0)
+ ksft_exit_fail_msg("socketpair() failed\n");
+
+ pid = sys_clone3(&args, sizeof(args));
+ if (pid < 0) {
+ close(ipc[0]);
+ close(ipc[1]);
+ ksft_exit_fail_msg("clone3() failed\n");
+ }
+
+ if (pid == 0) {
+ int fd;
+ char buf;
+
+ close(ipc[1]);
+
+ ret = read_nointr(ipc[0], &buf, 1);
+ if (ret != 1)
+ ksft_exit_fail_msg("read_nointr() failed\n");
+
+ if (setid_userns_root())
+ ksft_exit_fail_msg("setid_userns_root() failed\n");
+
+ fd = make_file_cmp_owner(0, 0);
+ if (fd < 0)
+ ksft_exit_fail_msg("make_file_cmp_owner() failed\n");
+
+ if (unix_send_fds(ipc[0], &fd, 1, NULL, 0) < 0)
+ ksft_exit_fail_msg("unix_send_fds() failed\n");
+
+ exit(EXIT_SUCCESS);
+ }
+
+ close(ipc[0]);
+
+ ret = write_id_mapping(UID_MAP, pid, id_map, id_map_size);
+ if (ret) {
+ ksft_exit_fail_msg("unix_send_fds() failed\n");
+ goto kill_child;
+ }
+
+ /* Must fail since a uid mapping has already been written. */
+ ret = write_id_mapping(FSUID_MAP, pid, fsid_map, fsid_map_size);
+ if (ret == 0) {
+ ksft_exit_fail_msg("unix_send_fds() succeeded\n");
+ goto kill_child;
+ }
+
+ ret = write_id_mapping(GID_MAP, pid, id_map, id_map_size);
+ if (ret) {
+ ksft_exit_fail_msg("unix_send_fds() failed\n");
+ goto kill_child;
+ }
+
+ /* Must fail since a gid mapping has already been written. */
+ ret = write_id_mapping(FSGID_MAP, pid, fsid_map, fsid_map_size);
+ if (ret == 0) {
+ ksft_exit_fail_msg("unix_send_fds() failed\n");
+ goto kill_child;
+ }
+
+ ret = write_nointr(ipc[1], "1", 1);
+ if (ret != 1) {
+ ksft_exit_fail_msg("write_nointr() failed\n");
+ goto kill_child;
+ }
+
+ if (unix_recv_fds(ipc[1], &fd, 1, NULL, 0) < 0) {
+ ksft_exit_fail_msg("unix_recv_fds() failed\n");
+ goto kill_child;
+ }
+
+ if (!has_expected_owner(fd, 100000, 100000)) {
+ ksft_exit_fail_msg("has_expected_owner() failed\n");
+ goto kill_child;
+ }
+
+ fret = EXIT_SUCCESS;
+
+wait_child:
+ ret = wait_for_pid(pid);
+ if (ret)
+ ksft_exit_fail_msg("wait_for_pid() failed\n");
+
+ if (fret == EXIT_SUCCESS)
+ return;
+ exit(fret);
+
+kill_child:
+ kill(pid, SIGKILL);
+ exit(EXIT_FAILURE);
+ goto wait_child;
+}
+
+static void test_fsid_maps_basic(void)
+{
+ int fret = EXIT_FAILURE;
+ ssize_t ret;
+ int fd = -EBADF;
+ pid_t pid;
+ int ipc[2];
+ struct clone_args args = {
+ .flags = CLONE_NEWUSER,
+ .exit_signal = SIGCHLD,
+ };
+
+ ret = socketpair(PF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc);
+ if (ret < 0)
+ ksft_exit_fail_msg("socketpair() failed\n");
+
+ pid = sys_clone3(&args, sizeof(args));
+ if (pid < 0) {
+ close(ipc[0]);
+ close(ipc[1]);
+ ksft_exit_fail_msg("clone3() failed\n");
+ }
+
+ if (pid == 0) {
+ int fd;
+ char buf;
+
+ close(ipc[1]);
+
+ ret = read_nointr(ipc[0], &buf, 1);
+ if (ret != 1)
+ ksft_exit_fail_msg("read_nointr() failed\n");
+
+ if (setid_userns_root())
+ ksft_exit_fail_msg("setid_userns_root() failed\n");
+
+ fd = make_file_cmp_owner(0, 0);
+ if (fd < 0)
+ ksft_exit_fail_msg("make_file_cmp_owner() failed\n");
+
+ if (unix_send_fds(ipc[0], &fd, 1, NULL, 0) < 0)
+ ksft_exit_fail_msg("unix_send_fds() failed\n");
+
+ exit(EXIT_SUCCESS);
+ }
+
+ close(ipc[0]);
+
+ /* Must fail since a uid mapping has already been written. */
+ ret = write_id_mapping(FSUID_MAP, pid, fsid_map, fsid_map_size);
+ if (ret) {
+ ksft_exit_fail_msg("unix_send_fds() failed\n");
+ goto kill_child;
+ }
+
+ ret = write_id_mapping(UID_MAP, pid, id_map, id_map_size);
+ if (ret) {
+ ksft_exit_fail_msg("unix_send_fds() failed\n");
+ goto kill_child;
+ }
+
+ /* Must fail since a gid mapping has already been written. */
+ ret = write_id_mapping(FSGID_MAP, pid, fsid_map, fsid_map_size);
+ if (ret) {
+ ksft_exit_fail_msg("unix_send_fds() failed\n");
+ goto kill_child;
+ }
+
+ ret = write_id_mapping(GID_MAP, pid, id_map, id_map_size);
+ if (ret) {
+ ksft_exit_fail_msg("unix_send_fds() failed\n");
+ goto kill_child;
+ }
+
+ ret = write_nointr(ipc[1], "1", 1);
+ if (ret != 1) {
+ ksft_exit_fail_msg("write_nointr() failed\n");
+ goto kill_child;
+ }
+
+ if (unix_recv_fds(ipc[1], &fd, 1, NULL, 0) < 0) {
+ ksft_exit_fail_msg("unix_recv_fds() failed\n");
+ goto kill_child;
+ }
+
+ if (!has_expected_owner(fd, 300000, 300000)) {
+ ksft_exit_fail_msg("has_expected_owner() failed\n");
+ goto kill_child;
+ }
+
+ fret = EXIT_SUCCESS;
+
+wait_child:
+ ret = wait_for_pid(pid);
+ if (ret)
+ ksft_exit_fail_msg("wait_for_pid() failed\n");
+
+ if (fret == EXIT_SUCCESS)
+ return;
+ exit(fret);
+
+kill_child:
+ kill(pid, SIGKILL);
+ exit(EXIT_FAILURE);
+ goto wait_child;
+}
+
+int main(int argc, char *argv[])
+{
+ if (getuid())
+ ksft_exit_skip("fsid mapping tests require root\n");
+
+ if (access("/proc/self/fsuid_map", F_OK))
+ ksft_exit_skip("fsid mappings not supported by this kernel\n");
+
+ test_clone3_supported();
+
+ test_id_maps_imply_fsid_maps();
+ test_fsid_maps_basic();
+
+ exit(EXIT_SUCCESS);
+}
--
2.25.0
When a uid or gid mount option is specified with devpts have it lookup the
corresponding kfsids in the fsid mappings. If no fsid mappings are setup the
behavior is unchanged, i.e. fsids are looked up in the id mappings.
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
unchanged
/* v3 */
unchanged
---
fs/devpts/inode.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index 42e5a766d33c..139958892572 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -24,6 +24,7 @@
#include <linux/parser.h>
#include <linux/fsnotify.h>
#include <linux/seq_file.h>
+#include <linux/fsuidgid.h>
#define DEVPTS_DEFAULT_MODE 0600
/*
@@ -277,7 +278,7 @@ static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts)
case Opt_uid:
if (match_int(&args[0], &option))
return -EINVAL;
- uid = make_kuid(current_user_ns(), option);
+ uid = make_kfsuid(current_user_ns(), option);
if (!uid_valid(uid))
return -EINVAL;
opts->uid = uid;
@@ -286,7 +287,7 @@ static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts)
case Opt_gid:
if (match_int(&args[0], &option))
return -EINVAL;
- gid = make_kgid(current_user_ns(), option);
+ gid = make_kfsgid(current_user_ns(), option);
if (!gid_valid(gid))
return -EINVAL;
opts->gid = gid;
@@ -410,7 +411,7 @@ static int devpts_show_options(struct seq_file *seq, struct dentry *root)
from_kuid_munged(&init_user_ns, opts->uid));
if (opts->setgid)
seq_printf(seq, ",gid=%u",
- from_kgid_munged(&init_user_ns, opts->gid));
+ from_kfsgid_munged(&init_user_ns, opts->gid));
seq_printf(seq, ",mode=%03o", opts->mode);
seq_printf(seq, ",ptmxmode=%03o", opts->ptmxmode);
if (opts->max < NR_UNIX98_PTY_MAX)
--
2.25.0
Switch may_{o_}create() to lookup fsids in the fsid mappings. If no fsid
mappings are setup the behavior is unchanged, i.e. fsids are looked up in the
id mappings.
Filesystems that share a superblock in all user namespaces they are mounted in
will retain their old semantics even with the introduction of fsid mappings.
Cc: Jann Horn <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
- Jann Horn <[email protected]>:
- Ensure that the correct fsid is used when dealing with userns visible
filesystems like proc.
/* v3 */
unchanged
---
fs/namei.c | 36 ++++++++++++++++++++++++++++--------
1 file changed, 28 insertions(+), 8 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index db6565c99825..c5b014000f13 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -39,6 +39,7 @@
#include <linux/bitops.h>
#include <linux/init_task.h>
#include <linux/uaccess.h>
+#include <linux/fsuidgid.h>
#include "internal.h"
#include "mount.h"
@@ -287,6 +288,13 @@ static int check_acl(struct inode *inode, int mask)
return -EAGAIN;
}
+static inline kuid_t get_current_fsuid(const struct inode *inode)
+{
+ if (is_userns_visible(inode->i_sb->s_iflags))
+ return current_kfsuid();
+ return current_fsuid();
+}
+
/*
* This does the basic permission checking
*/
@@ -294,7 +302,7 @@ static int acl_permission_check(struct inode *inode, int mask)
{
unsigned int mode = inode->i_mode;
- if (likely(uid_eq(current_fsuid(), inode->i_uid)))
+ if (likely(uid_eq(get_current_fsuid(inode), inode->i_uid)))
mode >>= 6;
else {
if (IS_POSIXACL(inode) && (mode & S_IRWXG)) {
@@ -980,7 +988,7 @@ static inline int may_follow_link(struct nameidata *nd)
/* Allowed if owner and follower match. */
inode = nd->link_inode;
- if (uid_eq(current_cred()->fsuid, inode->i_uid))
+ if (uid_eq(get_current_fsuid(inode), inode->i_uid))
return 0;
/* Allowed if parent directory not sticky and world-writable. */
@@ -1097,7 +1105,7 @@ static int may_create_in_sticky(umode_t dir_mode, kuid_t dir_uid,
(!sysctl_protected_regular && S_ISREG(inode->i_mode)) ||
likely(!(dir_mode & S_ISVTX)) ||
uid_eq(inode->i_uid, dir_uid) ||
- uid_eq(current_fsuid(), inode->i_uid))
+ uid_eq(get_current_fsuid(inode), inode->i_uid))
return 0;
if (likely(dir_mode & 0002) ||
@@ -2832,7 +2840,7 @@ EXPORT_SYMBOL(kern_path_mountpoint);
int __check_sticky(struct inode *dir, struct inode *inode)
{
- kuid_t fsuid = current_fsuid();
+ kuid_t fsuid = get_current_fsuid(inode);
if (uid_eq(inode->i_uid, fsuid))
return 0;
@@ -2902,6 +2910,20 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
return 0;
}
+static bool fsid_has_mapping(struct user_namespace *ns, struct super_block *sb)
+{
+ if (is_userns_visible(sb->s_iflags)) {
+ if (!kuid_has_mapping(ns, current_kfsuid()) ||
+ !kgid_has_mapping(ns, current_kfsgid()))
+ return false;
+ } else if (!kfsuid_has_mapping(ns, current_fsuid()) ||
+ !kfsgid_has_mapping(ns, current_fsgid())) {
+ return false;
+ }
+
+ return true;
+}
+
/* Check whether we can create an object with dentry child in directory
* dir.
* 1. We can't do it if child already exists (open has special treatment for
@@ -2920,8 +2942,7 @@ static inline int may_create(struct inode *dir, struct dentry *child)
if (IS_DEADDIR(dir))
return -ENOENT;
s_user_ns = dir->i_sb->s_user_ns;
- if (!kuid_has_mapping(s_user_ns, current_fsuid()) ||
- !kgid_has_mapping(s_user_ns, current_fsgid()))
+ if (!fsid_has_mapping(s_user_ns, dir->i_sb))
return -EOVERFLOW;
return inode_permission(dir, MAY_WRITE | MAY_EXEC);
}
@@ -3103,8 +3124,7 @@ static int may_o_create(const struct path *dir, struct dentry *dentry, umode_t m
return error;
s_user_ns = dir->dentry->d_sb->s_user_ns;
- if (!kuid_has_mapping(s_user_ns, current_fsuid()) ||
- !kgid_has_mapping(s_user_ns, current_fsgid()))
+ if (!fsid_has_mapping(s_user_ns, dir->dentry->d_sb))
return -EOVERFLOW;
error = inode_permission(dir->dentry->d_inode, MAY_WRITE | MAY_EXEC);
--
2.25.0
Make sure that during suid/sgid binary execution we lookup the fsids in the
fsid mappings. If the kernel is compiled without fsid mappings or no fsid
mappings are setup the behavior is unchanged.
Assuming we have a binary in a given user namespace that is owned by 0:0 in the
given user namespace which appears as 300000:300000 on-disk in the initial user
namespace. Now assume we write an id mapping of 0 100000 100000 and an fsid
mapping for 0 300000 300000 in the user namespace. When we hit bprm_fill_uid()
during setid execution we will retrieve inode kuid=300000 and kgid=300000. We
first check whether there's an fsid mapping for these kids. In our scenario we
find that they map to fsuid=0 and fsgid=0 in the user namespace. Now we
translate them into kids in the id mapping. In our example they translate to
kuid=100000 and kgid=100000 which means the file will ultimately run as uid=0
and gid=0 in the user namespace and as uid=100000, gid=100000 in the initial
user namespace.
Let's alter the example and assume that there is an fsid mapping of 0 300000
300000 set up but no id mapping has been setup for the user namespace. In this
the last step of translating into a valid kid pair in the id mappings will fail
and we will behave as before and ignore the sid bits.
Cc: Jann Horn <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
/* v2 */
patch added
- Christian Brauner <[email protected]>:
- Make sure that bprm_fill_uid() handles fsid mappings.
/* v3 */
- Christian Brauner <[email protected]>:
- Fix commit message.
---
fs/exec.c | 25 +++++++++++++++++++------
1 file changed, 19 insertions(+), 6 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index db17be51b112..9e4a7e757cef 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -62,6 +62,7 @@
#include <linux/oom.h>
#include <linux/compat.h>
#include <linux/vmalloc.h>
+#include <linux/fsuidgid.h>
#include <linux/uaccess.h>
#include <asm/mmu_context.h>
@@ -1518,8 +1519,8 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
{
struct inode *inode;
unsigned int mode;
- kuid_t uid;
- kgid_t gid;
+ kuid_t uid, euid;
+ kgid_t gid, egid;
/*
* Since this can be called multiple times (via prepare_binprm),
@@ -1551,18 +1552,30 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
inode_unlock(inode);
/* We ignore suid/sgid if there are no mappings for them in the ns */
- if (!kuid_has_mapping(bprm->cred->user_ns, uid) ||
- !kgid_has_mapping(bprm->cred->user_ns, gid))
+ if (!kfsuid_has_mapping(bprm->cred->user_ns, uid) ||
+ !kfsgid_has_mapping(bprm->cred->user_ns, gid))
return;
+ if (mode & S_ISUID) {
+ euid = kfsuid_to_kuid(bprm->cred->user_ns, uid);
+ if (!uid_valid(euid))
+ return;
+ }
+
+ if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) {
+ egid = kfsgid_to_kgid(bprm->cred->user_ns, gid);
+ if (!gid_valid(egid))
+ return;
+ }
+
if (mode & S_ISUID) {
bprm->per_clear |= PER_CLEAR_ON_SETID;
- bprm->cred->euid = uid;
+ bprm->cred->euid = euid;
}
if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) {
bprm->per_clear |= PER_CLEAR_ON_SETID;
- bprm->cred->egid = gid;
+ bprm->cred->egid = egid;
}
}
--
2.25.0
On Tue, 2020-02-18 at 15:33 +0100, Christian Brauner wrote:
> In the usual case of running an unprivileged container we will have
> setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
> correspond to this id mapping, i.e. all files which we want to appear
> as 0:0 inside the user namespace will be chowned to 100000:100000 on
> the host. This works, because whenever the kernel needs to do a
> filesystem access it will lookup the corresponding uid and gid in the
> idmapping tables of the container. Now think about the case where we
> want to have an id mapping of 0 100000 100000 but an on-disk mapping
> of 0 300000 100000 which is needed to e.g. share a single on-disk
> mapping with multiple containers that all have different id mappings.
> This will be problematic. Whenever a filesystem access is requested,
> the kernel will now try to lookup a mapping for 300000 in the id
> mapping tables of the user namespace but since there is none the
> files will appear to be owned by the overflow id, i.e. usually
> 65534:65534 or nobody:nogroup.
>
> With fsid mappings we can solve this by writing an id mapping of 0
> 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> access the kernel will now lookup the mapping for 300000 in the fsid
> mapping tables of the user namespace. And since such a mapping
> exists, the corresponding files will have correct ownership.
So I did compile this up in order to run the shiftfs tests over it to
see how it coped with the various corner cases. However, what I find
is it simply fails the fsid reverse mapping in the setup. Trying to
use a simple uid of 0 100000 1000 and a fsid of 100000 0 1000 fails the
entry setuid(0) call because of this code:
long __sys_setuid(uid_t uid)
{
struct user_namespace *ns =
current_user_ns();
const struct cred *old;
struct cred *new;
int
retval;
kuid_t kuid;
kuid_t kfsuid;
kuid = make_kuid(ns, uid);
if
(!uid_valid(kuid))
return -EINVAL;
kfsuid = make_kfsuid(ns, uid);
if
(!uid_valid(kfsuid))
return -EINVAL;
which means you can't have a fsid mapping that doesn't have the same
domain as the uid mapping, meaning a reverse mapping isn't possible
because the range and domain have to be inverse and disjoint.
James
On Tue, Feb 18, 2020 at 03:50:56PM -0800, James Bottomley wrote:
> On Tue, 2020-02-18 at 15:33 +0100, Christian Brauner wrote:
> > In the usual case of running an unprivileged container we will have
> > setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
> > correspond to this id mapping, i.e. all files which we want to appear
> > as 0:0 inside the user namespace will be chowned to 100000:100000 on
> > the host. This works, because whenever the kernel needs to do a
> > filesystem access it will lookup the corresponding uid and gid in the
> > idmapping tables of the container. Now think about the case where we
> > want to have an id mapping of 0 100000 100000 but an on-disk mapping
> > of 0 300000 100000 which is needed to e.g. share a single on-disk
> > mapping with multiple containers that all have different id mappings.
> > This will be problematic. Whenever a filesystem access is requested,
> > the kernel will now try to lookup a mapping for 300000 in the id
> > mapping tables of the user namespace but since there is none the
> > files will appear to be owned by the overflow id, i.e. usually
> > 65534:65534 or nobody:nogroup.
> >
> > With fsid mappings we can solve this by writing an id mapping of 0
> > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> > access the kernel will now lookup the mapping for 300000 in the fsid
> > mapping tables of the user namespace. And since such a mapping
> > exists, the corresponding files will have correct ownership.
>
> So I did compile this up in order to run the shiftfs tests over it to
> see how it coped with the various corner cases. However, what I find
> is it simply fails the fsid reverse mapping in the setup. Trying to
> use a simple uid of 0 100000 1000 and a fsid of 100000 0 1000 fails the
> entry setuid(0) call because of this code:
This is easy to fix. But what's the exact use-case?
On Tue, Feb 18, 2020 at 3:35 PM Christian Brauner
<[email protected]> wrote:
[...]
> - Let the keyctl infrastructure only operate on kfsid which are always
> mapped/looked up in the id mappings similar to what we do for
> filesystems that have the same superblock visible in multiple user
> namespaces.
>
> This version also comes with minimal tests which I intend to expand in
> the future.
>
> From pings and off-list questions and discussions at Google Container
> Security Summit there seems to be quite a lot of interest in this
> patchset with use-cases ranging from layer sharing for app containers
> and k8s, as well as data sharing between containers with different id
> mappings. I haven't Cced all people because I don't have all the email
> adresses at hand but I've at least added Phil now. :)
>
> This is the implementation of shiftfs which was cooked up during lunch at
> Linux Plumbers 2019 the day after the container's microconference. The
> idea is a design-stew from Stéphane, Aleksa, Eric, and myself (and by
> now also Jann.
> Back then we all were quite busy with other work and couldn't really sit
> down and implement it. But I took a few days last week to do this work,
> including demos and performance testing.
> This implementation does not require us to touch the VFS substantially
> at all. Instead, we implement shiftfs via fsid mappings.
> With this patch, it took me 20 mins to port both LXD and LXC to support
> shiftfs via fsid mappings.
[...]
Can you please grep through the kernel for all uses of ->fsuid and
->fsgid and fix them up appropriately? Some cases I still see:
The SafeSetID LSM wants to enforce that you can only use CAP_SETUID to
gain the privileges of a specific set of IDs:
static int safesetid_task_fix_setuid(struct cred *new,
const struct cred *old,
int flags)
{
/* Do nothing if there are no setuid restrictions for our old RUID. */
if (setuid_policy_lookup(old->uid, INVALID_UID) == SIDPOL_DEFAULT)
return 0;
if (uid_permitted_for_cred(old, new->uid) &&
uid_permitted_for_cred(old, new->euid) &&
uid_permitted_for_cred(old, new->suid) &&
uid_permitted_for_cred(old, new->fsuid))
return 0;
/*
* Kill this process to avoid potential security vulnerabilities
* that could arise from a missing whitelist entry preventing a
* privileged process from dropping to a lesser-privileged one.
*/
force_sig(SIGKILL);
return -EACCES;
}
This could theoretically be bypassed through setfsuid() if the kuid
based on the fsuid mappings is permitted but the kuid based on the
normal mappings is not.
fs/coredump.c in suid dump mode uses "cred->fsuid = GLOBAL_ROOT_UID";
this should probably also fix up the other uid, even if there is no
scenario in which it would actually be used at the moment?
The netfilter xt_owner stuff makes packet filtering decisions based on
the ->fsuid; it might be better to filter on the ->kfsuid so that you
can filter traffic from different user namespaces differently?
audit_log_task_info() is doing "from_kuid(&init_user_ns, cred->fsuid)".
On Wed, 2020-02-19 at 13:27 +0100, Christian Brauner wrote:
> On Tue, Feb 18, 2020 at 03:50:56PM -0800, James Bottomley wrote:
> > On Tue, 2020-02-18 at 15:33 +0100, Christian Brauner wrote:
[...]
> > > With fsid mappings we can solve this by writing an id mapping of
> > > 0 100000 100000 and an fsid mapping of 0 300000 100000. On
> > > filesystem access the kernel will now lookup the mapping for
> > > 300000 in the fsid mapping tables of the user namespace. And
> > > since such a mapping exists, the corresponding files will have
> > > correct ownership.
> >
> > So I did compile this up in order to run the shiftfs tests over it
> > to see how it coped with the various corner cases. However, what I
> > find is it simply fails the fsid reverse mapping in the
> > setup. Trying to use a simple uid of 0 100000 1000 and a fsid of
> > 100000 0 1000 fails the entry setuid(0) call because of this code:
>
> This is easy to fix. But what's the exact use-case?
Well, the use case I'm looking to solve is the same one it's always
been: getting a deprivileged fake root in a user_ns to be able to write
an image at fsuid 0.
I don't think it's solvable in your current framework, although
allowing the domain to be disjoint might possibly hack around it. The
problem with the proposed framework is that there are no backshifts
from the filesystem view, there are only forward shifts to the
filesystem view. This means that to get your framework to write a
filesystem at fsuid 0 you have to have an identity map for fsuid. Which
I can do: I tested uid shift 0 100000 1000 and fsuid shift 0 0 1000.
It does all work, as you'd expect because the container has real fs
root not a fake root. And that's the whole problem: Firstly, I'm fs
root for any filesystem my userns can see, so any imprecision in
setting up the mount namespace of the container and I own your host and
secondly any containment break and I'm privileged with respect to the
fs uid wherever I escape to so I will likewise own your host.
The only way to keep containment is to have a zero fsuid inside the
container corresponding to a non-zero one outside. And the only way to
solve the imprecision in mount namespace issue is to strictly control
the entry point at which the writing at fsuid 0 becomes active.
James
On Tue, Feb 18, 2020 at 03:33:46PM +0100, Christian Brauner wrote:
> With fsid mappings we can solve this by writing an id mapping of 0
> 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> access the kernel will now lookup the mapping for 300000 in the fsid
> mapping tables of the user namespace. And since such a mapping exists,
> the corresponding files will have correct ownership.
So if I have
/proc/self/uid_map: 0 100000 100000
/proc/self/fsid_map: 1000 1000 1
1. If I read files from the rootfs which have host uid 101000, they
will appear as uid 100 to me?
2. If I read host files with uid 1000, they will appear as uid 1000 to me?
3. If I create a new file, as uid 1000, what will be the inode owning uid?
On Wed, Feb 19, 2020 at 01:35:58PM -0600, Serge E. Hallyn wrote:
> On Tue, Feb 18, 2020 at 03:33:46PM +0100, Christian Brauner wrote:
> > With fsid mappings we can solve this by writing an id mapping of 0
> > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> > access the kernel will now lookup the mapping for 300000 in the fsid
> > mapping tables of the user namespace. And since such a mapping exists,
> > the corresponding files will have correct ownership.
>
> So if I have
>
> /proc/self/uid_map: 0 100000 100000
> /proc/self/fsid_map: 1000 1000 1
Oh, sorry. Your explanation in 20/25 i think set me straight, though I need
to think through a few more examples.
...
> 3. If I create a new file, as nsuid 1000, what will be the inode owning kuid?
(Note - I edited the quoted txt above to be more precise)
I'm still not quite clear on this. I believe the fsid mapping will take
precedence so it'll be uid 1000 ? Per mount behavior would be nice there,
but perhaps unwieldy.
On Wed, Feb 19, 2020 at 03:48:37PM -0600, Serge E. Hallyn wrote:
> On Wed, Feb 19, 2020 at 01:35:58PM -0600, Serge E. Hallyn wrote:
> > On Tue, Feb 18, 2020 at 03:33:46PM +0100, Christian Brauner wrote:
> > > With fsid mappings we can solve this by writing an id mapping of 0
> > > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> > > access the kernel will now lookup the mapping for 300000 in the fsid
> > > mapping tables of the user namespace. And since such a mapping exists,
> > > the corresponding files will have correct ownership.
> >
> > So if I have
> >
> > /proc/self/uid_map: 0 100000 100000
> > /proc/self/fsid_map: 1000 1000 1
>
> Oh, sorry. Your explanation in 20/25 i think set me straight, though I need
> to think through a few more examples.
>
> ...
>
> > 3. If I create a new file, as nsuid 1000, what will be the inode owning kuid?
>
> (Note - I edited the quoted txt above to be more precise)
>
> I'm still not quite clear on this. I believe the fsid mapping will take
> precedence so it'll be uid 1000 ? Per mount behavior would be nice there,
> but perhaps unwieldy.
The is_userns_visible() bits seems to be an attempt at understanding
what people would want per-mount, with a policy hard coded in the
kernel.
But maybe per-mount behavior can be solved more elegantly with shifted
bind mounts, so we can drop all that from this series, and ignore
per-mount settings here?
Tycho
On 2/18/20 9:33 AM, Christian Brauner wrote:
> Hey everyone,
>
> This is v3 after (off- and online) discussions with Jann the following
> changes were made:
> - To handle nested user namespaces cleanly, efficiently, and with full
> backwards compatibility for non fsid-mapping aware workloads we only
> allow writing fsid mappings as long as the corresponding id mapping
> type has not been written.
> - Split the patch which adds the internal ability in
> kernel/user_namespace to verify and write fsid mappings into tree
> patches:
> 1. [PATCH v3 04/25] fsuidgid: add fsid mapping helpers
> patch to implement core helpers for fsid translations (i.e.
> make_kfs*id(), from_kfs*id{_munged}(), kfs*id_to_k*id(),
> k*id_to_kfs*id()
> 2. [PATCH v3 05/25] user_namespace: refactor map_write()
> patch to refactor map_write() in order to prepare for actual fsid
> mappings changes in the following patch. (This should make it
> easier to review.)
> 3. [PATCH v3 06/25] user_namespace: make map_write() support fsid mappings
> patch to implement actual fsid mappings support in mape_write()
> - Let the keyctl infrastructure only operate on kfsid which are always
> mapped/looked up in the id mappings similar to what we do for
> filesystems that have the same superblock visible in multiple user
> namespaces.
>
> This version also comes with minimal tests which I intend to expand in
> the future.
>
> From pings and off-list questions and discussions at Google Container
> Security Summit there seems to be quite a lot of interest in this
> patchset with use-cases ranging from layer sharing for app containers
> and k8s, as well as data sharing between containers with different id
> mappings. I haven't Cced all people because I don't have all the email
> adresses at hand but I've at least added Phil now. :)
>
I put this into a kernel for our container guys to mess with in order to
validate it would actually be useful for real world uses. I've cc'ed the guy
who did all of the work in case you have specific questions.
Good news is the interface is acceptable, albeit apparently the whole user ns
interface sucks in general. But you haven't made it worse, so success!
But in testing it there appears to be a problem with tmpfs? Our applications
will use shared memory segments for certain things and it apparently breaks this
in interesting ways, it appears to not shift the UID appropriately on tmpfs.
This seems to be relatively straightforward to reproduce, but if you have
trouble let me know and I'll come up with a shell script that reproduces the
problem.
We are happy to continue testing these patches to make sure they're working in
our container setup, if you want to CC me on future submissions I can build them
for our internal testing and validate them as well. Thanks,
Josef
On Thu, Feb 27, 2020 at 02:33:04PM -0500, Josef Bacik wrote:
> On 2/18/20 9:33 AM, Christian Brauner wrote:
> > Hey everyone,
> >
> > This is v3 after (off- and online) discussions with Jann the following
> > changes were made:
> > - To handle nested user namespaces cleanly, efficiently, and with full
> > backwards compatibility for non fsid-mapping aware workloads we only
> > allow writing fsid mappings as long as the corresponding id mapping
> > type has not been written.
> > - Split the patch which adds the internal ability in
> > kernel/user_namespace to verify and write fsid mappings into tree
> > patches:
> > 1. [PATCH v3 04/25] fsuidgid: add fsid mapping helpers
> > patch to implement core helpers for fsid translations (i.e.
> > make_kfs*id(), from_kfs*id{_munged}(), kfs*id_to_k*id(),
> > k*id_to_kfs*id()
> > 2. [PATCH v3 05/25] user_namespace: refactor map_write()
> > patch to refactor map_write() in order to prepare for actual fsid
> > mappings changes in the following patch. (This should make it
> > easier to review.)
> > 3. [PATCH v3 06/25] user_namespace: make map_write() support fsid mappings
> > patch to implement actual fsid mappings support in mape_write()
> > - Let the keyctl infrastructure only operate on kfsid which are always
> > mapped/looked up in the id mappings similar to what we do for
> > filesystems that have the same superblock visible in multiple user
> > namespaces.
> >
> > This version also comes with minimal tests which I intend to expand in
> > the future.
> >
> > From pings and off-list questions and discussions at Google Container
> > Security Summit there seems to be quite a lot of interest in this
> > patchset with use-cases ranging from layer sharing for app containers
> > and k8s, as well as data sharing between containers with different id
> > mappings. I haven't Cced all people because I don't have all the email
> > adresses at hand but I've at least added Phil now. :)
> >
> I put this into a kernel for our container guys to mess with in order to
> validate it would actually be useful for real world uses. I've cc'ed the
> guy who did all of the work in case you have specific questions.
>
> Good news is the interface is acceptable, albeit apparently the whole user
> ns interface sucks in general. But you haven't made it worse, so success!
Well I very much disagree here :) With the first part! But I do
understand the shortcomings. Anyway,
I still hope we get to talk about this in person, but IMO this is the
right approach (this being - thinking about how to make the uid mappings
more flexible without making them too complicated to be safe to use),
but a bit too static in terms of target. There are at least two ways
that I could see usefully generalizing it
From a user space pov, the following goal is indespensible (for my use
cases): that the fsuid be selectable based on fs, mountpoint, or file
context (as in selinux).
From a userns pov, one way to look at it is this: when task t1 signals
task t2, it's not only t1's namespace that's considered when filling in
the sender uid, but also t2's. Likewise, when writing a file, we should
consider both t1's fsuid+userns, and the file's, mount's, or filesystem's
userns.
From that POV, your patch is a step in the right direction and could be
taken as is (modulo any tmpfs fix Josef needs :) From there I would
propose adding a 'userns=<uidnsfd>' bind mount option, so we could create
an empty userns with the desired mapping (subject to permissions granted
by subuids), get an fd to the uidns, and say
mount --bind -o uidns=5 /shared /containers/c1/mnt/shared
So now when I write a file /etc/hosts as container fsuid 0, it'll be
subject to the container rootfs mount's uid mapping, presumably
100000. When I write /mnt/shared/hello, it'll be subject to the mount's
uid mapping, which might be 1000.
-serge