2020-06-03 16:26:54

by Adrian Reber

[permalink] [raw]
Subject: [PATCH v2 0/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

This is v2 of the 'Introduce CAP_CHECKPOINT_RESTORE' patchset. The
difference from v1 are:

* Renamed CAP_RESTORE to CAP_CHECKPOINT_RESTORE
* Added a test
* Added details about CRIU's use of map_files
* Allow changing /proc/self/exe link with CAP_CHECKPOINT_RESTORE

The biggest difference is that the patchset now provides all the
changes, which are necessary to use CRIU to checkpoint and restore a
process as non-root if CAP_CHECKPOINT_RESTORE is set.

Adrian Reber (2):
capabilities: Introduce CAP_CHECKPOINT_RESTORE
selftests: add clone3() CAP_CHECKPOINT_RESTORE test

Nicolas Viennot (1):
prctl: Allow ptrace capable processes to change exe_fd

fs/proc/base.c | 8 +-
include/linux/capability.h | 6 +
include/uapi/linux/capability.h | 9 +-
kernel/pid.c | 2 +-
kernel/pid_namespace.c | 2 +-
kernel/sys.c | 21 +-
security/selinux/include/classmap.h | 5 +-
tools/testing/selftests/clone3/Makefile | 4 +-
.../clone3/clone3_cap_checkpoint_restore.c | 203 ++++++++++++++++++
9 files changed, 245 insertions(+), 15 deletions(-)
create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c


base-commit: 48f99181fc118d82dc8bf6c7221ad1c654cb8bc2
--
2.26.2


2020-06-03 16:27:06

by Adrian Reber

[permalink] [raw]
Subject: [PATCH v2 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
checkpoint/restore for non-root users.

Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
asked numerous times if it is possible to checkpoint/restore a process as
non-root. The answer usually was: 'almost'.

The main blocker to restore a process as non-root was to control the PID of the
restored process. This feature available via the clone3 system call, or via
/proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.

In the past two years, requests for non-root checkpoint/restore have increased
due to the following use cases:
* Checkpoint/Restore in an HPC environment in combination with a resource
manager distributing jobs where users are always running as non-root.
There is a desire to provide a way to checkpoint and restore long running
jobs.
* Container migration as non-root
* We have been in contact with JVM developers who are integrating
CRIU into a Java VM to decrease the startup time. These checkpoint/restore
applications are not meant to be running with CAP_SYS_ADMIN.

We have seen the following workarounds:
* Use a setuid wrapper around CRIU:
See https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
* Use a setuid helper that writes to ns_last_pid.
Unfortunately, this helper delegation technique is impossible to use with
clone3, and is thus prone to races.
See https://github.com/twosigma/set_ns_last_pid
* Cycle through PIDs with fork() until the desired PID is reached:
This has been demonstrated to work with cycling rates of 100,000 PIDs/s
See https://github.com/twosigma/set_ns_last_pid
* Patch out the CAP_SYS_ADMIN check from the kernel
* Run the desired application in a new user and PID namespace to provide
a local CAP_SYS_ADMIN for controlling PIDs. This technique has limited use in
typical container environments (e.g., Kubernetes) as /proc is
typically protected with read-only layers (e.g., /proc/sys) for hardening
purposes. Read-only layers prevent additional /proc mounts (due to proc's
SB_I_USERNS_VISIBLE property), making the use of new PID namespaces limited as
certain applications need access to /proc matching their PID namespace.

The introduced capability allows to:
* Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
for the corresponding PID namespace via ns_last_pid/clone3.
* Open files in /proc/pid/map_files when the current user is
CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
files that are unreachable via the file system such as deleted files, or memfd
files.

See corresponding selftest for an example with clone3().

Signed-off-by: Adrian Reber <[email protected]>
Signed-off-by: Nicolas Viennot <[email protected]>
---
v2:
- Renamed CAP_RESTORE to CAP_CHECKPOINT_RESTORE
- Added a test
- Added details about CRIU's use of map_files
- Allow changing /proc/self/exe link with CAP_CHECKPOINT_RESTORE
---
fs/proc/base.c | 8 ++++----
include/linux/capability.h | 6 ++++++
include/uapi/linux/capability.h | 9 ++++++++-
kernel/pid.c | 2 +-
kernel/pid_namespace.c | 2 +-
security/selinux/include/classmap.h | 5 +++--
6 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index d86c0afc8a85..ce02f3a4b2d7 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2189,16 +2189,16 @@ struct map_files_info {
};

/*
- * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
- * symlinks may be used to bypass permissions on ancestor directories in the
- * path to the file in question.
+ * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, due
+ * to concerns about how the symlinks may be used to bypass permissions on
+ * ancestor directories in the path to the file in question.
*/
static const char *
proc_map_files_get_link(struct dentry *dentry,
struct inode *inode,
struct delayed_call *done)
{
- if (!capable(CAP_SYS_ADMIN))
+ if (!(capable(CAP_SYS_ADMIN) || capable(CAP_CHECKPOINT_RESTORE)))
return ERR_PTR(-EPERM);

return proc_pid_get_link(dentry, inode, done);
diff --git a/include/linux/capability.h b/include/linux/capability.h
index b4345b38a6be..1e7fe311cabe 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -261,6 +261,12 @@ static inline bool bpf_capable(void)
return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
}

+static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
+{
+ return ns_capable(ns, CAP_CHECKPOINT_RESTORE) ||
+ ns_capable(ns, CAP_SYS_ADMIN);
+}
+
/* audit system wants to get cap info from files as well */
extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);

diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 48ff0757ae5e..395dd0df8d08 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -408,7 +408,14 @@ struct vfs_ns_cap_data {
*/
#define CAP_BPF 39

-#define CAP_LAST_CAP CAP_BPF
+
+/* Allow checkpoint/restore related operations */
+/* Allow PID selection during clone3() */
+/* Allow writing to ns_last_pid */
+
+#define CAP_CHECKPOINT_RESTORE 40
+
+#define CAP_LAST_CAP CAP_CHECKPOINT_RESTORE

#define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)

diff --git a/kernel/pid.c b/kernel/pid.c
index 3122043fe364..ada55c7b2b19 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -198,7 +198,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
if (tid != 1 && !tmp->child_reaper)
goto out_free;
retval = -EPERM;
- if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
+ if (!checkpoint_restore_ns_capable(tmp->user_ns))
goto out_free;
set_tid_size--;
}
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 0e5ac162c3a8..ac135bd600eb 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -269,7 +269,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
struct ctl_table tmp = *table;
int ret, next;

- if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
+ if (write && !checkpoint_restore_ns_capable(pid_ns->user_ns))
return -EPERM;

/*
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 98e1513b608a..40cebde62856 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -27,9 +27,10 @@
"audit_control", "setfcap"

#define COMMON_CAP2_PERMS "mac_override", "mac_admin", "syslog", \
- "wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf"
+ "wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf", \
+ "checkpoint_restore"

-#if CAP_LAST_CAP > CAP_BPF
+#if CAP_LAST_CAP > CAP_CHECKPOINT_RESTORE
#error New capability defined, please update COMMON_CAP2_PERMS.
#endif

--
2.26.2

2020-06-03 16:28:07

by Adrian Reber

[permalink] [raw]
Subject: [PATCH v2 3/3] prctl: Allow ptrace capable processes to change exe_fd

From: Nicolas Viennot <[email protected]>

The current process is authorized to change its /proc/self/exe link via
two policies:
1) The current user can do checkpoint/restore In other words is
CAP_SYS_ADMIN or CAP_CHECKPOINT_RESTORE capable.
2) The current user can use ptrace.

With access to ptrace facilities, a process can do the following: fork a
child, execve() the target executable, and have the child use ptrace()
to replace the memory content of the current process. This technique
makes it possible to masquerade an arbitrary program as any executable,
even setuid ones.

This commit also changes the permission error code from -EINVAL to
-EPERM for consistency with the rest of the prctl() syscall when
checking capabilities.

Signed-off-by: Nicolas Viennot <[email protected]>
Signed-off-by: Adrian Reber <[email protected]>
---
kernel/sys.c | 21 ++++++++++++++++-----
1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index fd46865b46ba..2f34108e26e0 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1994,12 +1994,23 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data

if (prctl_map.exe_fd != (u32)-1) {
/*
- * Make sure the caller has the rights to
- * change /proc/pid/exe link: only local sys admin should
- * be allowed to.
+ * The current process is authorized to change its
+ * /proc/self/exe link via two policies:
+ * 1) The current user can do checkpoint/restore
+ * In other words is CAP_SYS_ADMIN or
+ * CAP_CHECKPOINT_RESTORE capable.
+ * 2) The current user can use ptrace.
+ *
+ * With access to ptrace facilities, a process can do the
+ * following: fork a child, execve() the target executable,
+ * and have the child use ptrace() to replace the memory
+ * content of the current process. This technique makes it
+ * possible to masquerade an arbitrary program as the target
+ * executable, even if it is setuid.
*/
- if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
- return -EINVAL;
+ if (!(checkpoint_restore_ns_capable(current_user_ns()) ||
+ security_ptrace_access_check(current, PTRACE_MODE_ATTACH_REALCREDS)))
+ return -EPERM;

error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd);
if (error)
--
2.26.2

2020-06-03 16:28:30

by Adrian Reber

[permalink] [raw]
Subject: [PATCH v2 2/3] selftests: add clone3() CAP_CHECKPOINT_RESTORE test

This adds a test that changes its UID, uses capabilities to
get CAP_CHECKPOINT_RESTORE and uses clone3() with set_tid to
create a process with a given PID as non-root.

Signed-off-by: Adrian Reber <[email protected]>
---
tools/testing/selftests/clone3/Makefile | 4 +-
.../clone3/clone3_cap_checkpoint_restore.c | 203 ++++++++++++++++++
2 files changed, 206 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c

diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile
index cf976c732906..ef7564cb7abe 100644
--- a/tools/testing/selftests/clone3/Makefile
+++ b/tools/testing/selftests/clone3/Makefile
@@ -1,6 +1,8 @@
# SPDX-License-Identifier: GPL-2.0
CFLAGS += -g -I../../../../usr/include/
+LDLIBS += -lcap

-TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
+TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \
+ clone3_cap_checkpoint_restore

include ../lib.mk
diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
new file mode 100644
index 000000000000..2cc3d57b91f2
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
@@ -0,0 +1,203 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Based on Christian Brauner's clone3() example.
+ * These tests are assuming to be running in the host's
+ * PID namespace.
+ */
+
+/* capabilities related code based on selftests/bpf/test_verifier.c */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <linux/types.h>
+#include <linux/sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <sys/capability.h>
+#include <sys/prctl.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/un.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <sched.h>
+
+#include "../kselftest.h"
+#include "clone3_selftests.h"
+
+#ifndef MAX_PID_NS_LEVEL
+#define MAX_PID_NS_LEVEL 32
+#endif
+
+static void child_exit(int ret)
+{
+ fflush(stdout);
+ fflush(stderr);
+ _exit(ret);
+}
+
+static int call_clone3_set_tid(pid_t * set_tid, size_t set_tid_size)
+{
+ int status;
+ pid_t pid = -1;
+
+ struct clone_args args = {
+ .exit_signal = SIGCHLD,
+ .set_tid = ptr_to_u64(set_tid),
+ .set_tid_size = set_tid_size,
+ };
+
+ pid = sys_clone3(&args, sizeof(struct clone_args));
+ if (pid < 0) {
+ ksft_print_msg("%s - Failed to create new process\n",
+ strerror(errno));
+ return -errno;
+ }
+
+ if (pid == 0) {
+ int ret;
+ char tmp = 0;
+
+ ksft_print_msg
+ ("I am the child, my PID is %d (expected %d)\n",
+ getpid(), set_tid[0]);
+
+ if (set_tid[0] != getpid())
+ child_exit(EXIT_FAILURE);
+ child_exit(EXIT_SUCCESS);
+ }
+
+ ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
+ getpid(), pid);
+
+ if (waitpid(pid, &status, 0) < 0) {
+ ksft_print_msg("Child returned %s\n", strerror(errno));
+ return -errno;
+ }
+
+ if (!WIFEXITED(status))
+ return -1;
+
+ return WEXITSTATUS(status);
+}
+
+static int test_clone3_set_tid(pid_t * set_tid,
+ size_t set_tid_size, int expected)
+{
+ int ret;
+
+ ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d\n",
+ getpid(), set_tid[0]);
+ ret = call_clone3_set_tid(set_tid, set_tid_size);
+
+ ksft_print_msg
+ ("[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
+ getpid(), set_tid[0], ret, expected);
+ if (ret != expected) {
+ ksft_test_result_fail
+ ("[%d] Result (%d) is different than expected (%d)\n",
+ getpid(), ret, expected);
+ return -1;
+ }
+ ksft_test_result_pass
+ ("[%d] Result (%d) matches expectation (%d)\n", getpid(), ret,
+ expected);
+
+ return 0;
+}
+
+struct libcap {
+ struct __user_cap_header_struct hdr;
+ struct __user_cap_data_struct data[2];
+};
+
+static int set_capability()
+{
+ cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
+ struct libcap *cap;
+ int ret = -1;
+ cap_t caps;
+
+ caps = cap_get_proc();
+ if (!caps) {
+ perror("cap_get_proc");
+ return -1;
+ }
+
+ /* Drop all capabilities */
+ if (cap_clear(caps)) {
+ perror("cap_clear");
+ goto out;
+ }
+
+ cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
+ cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
+
+ cap = (struct libcap *) caps;
+
+ /* 40 -> CAP_CHECKPOINT_RESTORE */
+ cap->data[1].effective |= 1 << (40 - 32);
+ cap->data[1].permitted |= 1 << (40 - 32);
+
+ if (cap_set_proc(caps)) {
+ perror("cap_set_proc");
+ goto out;
+ }
+ ret = 0;
+out:
+ if (cap_free(caps))
+ perror("cap_free");
+ return ret;
+}
+
+int main(int argc, char *argv[])
+{
+ pid_t pid;
+ int status;
+ int ret = 0;
+ pid_t set_tid[1];
+ uid_t uid = getuid();
+
+ ksft_print_header();
+ test_clone3_supported();
+ ksft_set_plan(2);
+
+ if (uid != 0) {
+ ksft_cnt.ksft_xskip = ksft_plan;
+ ksft_print_msg("Skipping all tests as non-root\n");
+ return ksft_exit_pass();
+ }
+
+ memset(&set_tid, 0, sizeof(set_tid));
+
+ /* Find the current active PID */
+ pid = fork();
+ if (pid == 0) {
+ ksft_print_msg("Child has PID %d\n", getpid());
+ child_exit(EXIT_SUCCESS);
+ }
+ if (waitpid(pid, &status, 0) < 0)
+ ksft_exit_fail_msg("Waiting for child %d failed", pid);
+
+ /* After the child has finished, its PID should be free. */
+ set_tid[0] = pid;
+
+ if (set_capability())
+ ksft_test_result_fail
+ ("Could not set CAP_CHECKPOINT_RESTORE\n");
+ prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0);
+ /* This would fail without CAP_CHECKPOINT_RESTORE */
+ setgid(1000);
+ setuid(1000);
+ set_tid[0] = pid;
+ ret |= test_clone3_set_tid(set_tid, 1, -EPERM);
+ if (set_capability())
+ ksft_test_result_fail
+ ("Could not set CAP_CHECKPOINT_RESTORE\n");
+ /* This should work as we have CAP_CHECKPOINT_RESTORE as non-root */
+ ret |= test_clone3_set_tid(set_tid, 1, 0);
+
+ return !ret ? ksft_exit_pass() : ksft_exit_fail();
+}
--
2.26.2

2020-06-03 17:06:01

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

On Wed, Jun 03, 2020 at 06:23:26PM +0200, Adrian Reber wrote:
...
>
> /*
> - * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
> - * symlinks may be used to bypass permissions on ancestor directories in the
> - * path to the file in question.
> + * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, due
> + * to concerns about how the symlinks may be used to bypass permissions on
> + * ancestor directories in the path to the file in question.
> */
> static const char *
> proc_map_files_get_link(struct dentry *dentry,
> struct inode *inode,
> struct delayed_call *done)
> {
> - if (!capable(CAP_SYS_ADMIN))
> + if (!(capable(CAP_SYS_ADMIN) || capable(CAP_CHECKPOINT_RESTORE)))
> return ERR_PTR(-EPERM);

You know, I'm still not sure if we need this capable() check at all since
we have proc_fd_access_allowed() called but anyway can we please make this
if() condition more explicit

if (!capable(CAP_SYS_ADMIN) && !capable(CAP_CHECKPOINT_RESTORE))
return ERR_PTR(-EPERM);

though I won't insist. And I'll reread the series a bit later once I've
some spare time to.

2020-06-09 03:44:31

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

On Wed, Jun 03, 2020 at 06:23:26PM +0200, Adrian Reber wrote:
> This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
> checkpoint/restore for non-root users.
>
> Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
> asked numerous times if it is possible to checkpoint/restore a process as
> non-root. The answer usually was: 'almost'.
>
> The main blocker to restore a process as non-root was to control the PID of the
> restored process. This feature available via the clone3 system call, or via
> /proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.
>
> In the past two years, requests for non-root checkpoint/restore have increased
> due to the following use cases:
> * Checkpoint/Restore in an HPC environment in combination with a resource
> manager distributing jobs where users are always running as non-root.
> There is a desire to provide a way to checkpoint and restore long running
> jobs.
> * Container migration as non-root
> * We have been in contact with JVM developers who are integrating
> CRIU into a Java VM to decrease the startup time. These checkpoint/restore
> applications are not meant to be running with CAP_SYS_ADMIN.
>
...
>
> The introduced capability allows to:
> * Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
> for the corresponding PID namespace via ns_last_pid/clone3.
> * Open files in /proc/pid/map_files when the current user is
> CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
> files that are unreachable via the file system such as deleted files, or memfd
> files.

PTRACE_O_SUSPEND_SECCOMP is needed for C/R and it is protected by
CAP_SYS_ADMIN too.

Thanks,
Andrei

2020-06-09 07:49:47

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

On Mon, Jun 08, 2020 at 08:42:21PM -0700, Andrei Vagin wrote:
> On Wed, Jun 03, 2020 at 06:23:26PM +0200, Adrian Reber wrote:
> > This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
> > checkpoint/restore for non-root users.
> >
> > Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
> > asked numerous times if it is possible to checkpoint/restore a process as
> > non-root. The answer usually was: 'almost'.
> >
> > The main blocker to restore a process as non-root was to control the PID of the
> > restored process. This feature available via the clone3 system call, or via
> > /proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.
> >
> > In the past two years, requests for non-root checkpoint/restore have increased
> > due to the following use cases:
> > * Checkpoint/Restore in an HPC environment in combination with a resource
> > manager distributing jobs where users are always running as non-root.
> > There is a desire to provide a way to checkpoint and restore long running
> > jobs.
> > * Container migration as non-root
> > * We have been in contact with JVM developers who are integrating
> > CRIU into a Java VM to decrease the startup time. These checkpoint/restore
> > applications are not meant to be running with CAP_SYS_ADMIN.
> >
> ...
> >
> > The introduced capability allows to:
> > * Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
> > for the corresponding PID namespace via ns_last_pid/clone3.
> > * Open files in /proc/pid/map_files when the current user is
> > CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
> > files that are unreachable via the file system such as deleted files, or memfd
> > files.
>
> PTRACE_O_SUSPEND_SECCOMP is needed for C/R and it is protected by
> CAP_SYS_ADMIN too.

This is currently capable(CAP_SYS_ADMIN) (init_ns capable) why is it
safe to allow unprivileged users to suspend security policies? That
sounds like a bad idea.

if (unlikely(data & PTRACE_O_SUSPEND_SECCOMP)) {
if (!IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) ||
!IS_ENABLED(CONFIG_SECCOMP))
return -EINVAL;

if (!capable(CAP_SYS_ADMIN))
return -EPERM;

if (seccomp_mode(&current->seccomp) != SECCOMP_MODE_DISABLED ||
current->ptrace & PT_SUSPEND_SECCOMP)
return -EPERM;
}

Christian

2020-06-09 16:09:08

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

On Tue, Jun 09, 2020 at 09:44:22AM +0200, Christian Brauner wrote:
> On Mon, Jun 08, 2020 at 08:42:21PM -0700, Andrei Vagin wrote:
> > On Wed, Jun 03, 2020 at 06:23:26PM +0200, Adrian Reber wrote:
> > > This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
> > > checkpoint/restore for non-root users.
> > >
> > > Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
> > > asked numerous times if it is possible to checkpoint/restore a process as
> > > non-root. The answer usually was: 'almost'.
> > >
> > > The main blocker to restore a process as non-root was to control the PID of the
> > > restored process. This feature available via the clone3 system call, or via
> > > /proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.
> > >
> > > In the past two years, requests for non-root checkpoint/restore have increased
> > > due to the following use cases:
> > > * Checkpoint/Restore in an HPC environment in combination with a resource
> > > manager distributing jobs where users are always running as non-root.
> > > There is a desire to provide a way to checkpoint and restore long running
> > > jobs.
> > > * Container migration as non-root
> > > * We have been in contact with JVM developers who are integrating
> > > CRIU into a Java VM to decrease the startup time. These checkpoint/restore
> > > applications are not meant to be running with CAP_SYS_ADMIN.
> > >
> > ...
> > >
> > > The introduced capability allows to:
> > > * Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
> > > for the corresponding PID namespace via ns_last_pid/clone3.
> > > * Open files in /proc/pid/map_files when the current user is
> > > CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
> > > files that are unreachable via the file system such as deleted files, or memfd
> > > files.
> >
> > PTRACE_O_SUSPEND_SECCOMP is needed for C/R and it is protected by
> > CAP_SYS_ADMIN too.
>
> This is currently capable(CAP_SYS_ADMIN) (init_ns capable) why is it
> safe to allow unprivileged users to suspend security policies? That
> sounds like a bad idea.

Why do you think so bad about me;). I don't suggest to remove or
downgrade this capability check. The patch allows all c/r related
operations if the current has CAP_CHECKPOINT_RESTORE.

So in this case the check:
if (!capable(CAP_SYS_ADMIN))
return -EPERM;

will be converted in:
if (!capable(CAP_SYS_ADMIN) && !capable(CAP_CHECKPOINT_RESTORE))
return -EPERM;

If we want to think about how to convert this capable to ns_capable, we
need to do this in a separate series. And the logic may be that a
process is able to suspend only filters that have been added from the
current user-namespace or its descendants. But we need to think about
this more carefully, maybe there are more pitfalls.

2020-06-09 16:19:59

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

On Tue, Jun 09, 2020 at 09:06:27AM -0700, Andrei Vagin wrote:
> On Tue, Jun 09, 2020 at 09:44:22AM +0200, Christian Brauner wrote:
> > On Mon, Jun 08, 2020 at 08:42:21PM -0700, Andrei Vagin wrote:
> > > On Wed, Jun 03, 2020 at 06:23:26PM +0200, Adrian Reber wrote:
> > > > This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
> > > > checkpoint/restore for non-root users.
> > > >
> > > > Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
> > > > asked numerous times if it is possible to checkpoint/restore a process as
> > > > non-root. The answer usually was: 'almost'.
> > > >
> > > > The main blocker to restore a process as non-root was to control the PID of the
> > > > restored process. This feature available via the clone3 system call, or via
> > > > /proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.
> > > >
> > > > In the past two years, requests for non-root checkpoint/restore have increased
> > > > due to the following use cases:
> > > > * Checkpoint/Restore in an HPC environment in combination with a resource
> > > > manager distributing jobs where users are always running as non-root.
> > > > There is a desire to provide a way to checkpoint and restore long running
> > > > jobs.
> > > > * Container migration as non-root
> > > > * We have been in contact with JVM developers who are integrating
> > > > CRIU into a Java VM to decrease the startup time. These checkpoint/restore
> > > > applications are not meant to be running with CAP_SYS_ADMIN.
> > > >
> > > ...
> > > >
> > > > The introduced capability allows to:
> > > > * Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
> > > > for the corresponding PID namespace via ns_last_pid/clone3.
> > > > * Open files in /proc/pid/map_files when the current user is
> > > > CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
> > > > files that are unreachable via the file system such as deleted files, or memfd
> > > > files.
> > >
> > > PTRACE_O_SUSPEND_SECCOMP is needed for C/R and it is protected by
> > > CAP_SYS_ADMIN too.
> >
> > This is currently capable(CAP_SYS_ADMIN) (init_ns capable) why is it
> > safe to allow unprivileged users to suspend security policies? That
> > sounds like a bad idea.
>
> Why do you think so bad about me;). I don't suggest to remove or

Andrei, nothing could be further from me than to think bad about you!
You've done way too much excellent work. ;)

> downgrade this capability check. The patch allows all c/r related
> operations if the current has CAP_CHECKPOINT_RESTORE.
>
> So in this case the check:
> if (!capable(CAP_SYS_ADMIN))
> return -EPERM;
>
> will be converted in:
> if (!capable(CAP_SYS_ADMIN) && !capable(CAP_CHECKPOINT_RESTORE))
> return -EPERM;

Yeah, I got that but what's the goal here? Isn't it that you want to
make it safe to install the criu binary with the CAP_CHECKPOINT_RESTORE
fscap set so that unprivileged users can restore their own processes
without creating a new user namespace or am I missing something? The
use-cases in the cover-letter make it sound like that's what this is
leading up to:

> > > > * Checkpoint/Restore in an HPC environment in combination with a resource
> > > > manager distributing jobs where users are always running as non-root.
> > > > There is a desire to provide a way to checkpoint and restore long running
> > > > jobs.
> > > > * Container migration as non-root
> > > > * We have been in contact with JVM developers who are integrating
> > > > CRIU into a Java VM to decrease the startup time. These checkpoint/restore
> > > > applications are not meant to be running with CAP_SYS_ADMIN.

But maybe I'm just misunderstanding crucial bits (likely (TM)).

Christian

2020-06-09 20:34:28

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

On Wed, Jun 03, 2020 at 06:23:26PM +0200, Adrian Reber wrote:
> This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
> checkpoint/restore for non-root users.
>
> Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
> asked numerous times if it is possible to checkpoint/restore a process as
> non-root. The answer usually was: 'almost'.
>
> The main blocker to restore a process as non-root was to control the PID of the
> restored process. This feature available via the clone3 system call, or via
> /proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.
...
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index d86c0afc8a85..ce02f3a4b2d7 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2189,16 +2189,16 @@ struct map_files_info {
> };
>
> /*
> - * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
> - * symlinks may be used to bypass permissions on ancestor directories in the
> - * path to the file in question.
> + * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, due
> + * to concerns about how the symlinks may be used to bypass permissions on
> + * ancestor directories in the path to the file in question.
> */
> static const char *
> proc_map_files_get_link(struct dentry *dentry,
> struct inode *inode,
> struct delayed_call *done)
> {
> - if (!capable(CAP_SYS_ADMIN))
> + if (!(capable(CAP_SYS_ADMIN) || capable(CAP_CHECKPOINT_RESTORE)))
> return ERR_PTR(-EPERM);

First of all -- sorry for late reply. You know, looking into this code more
I think this CAP_SYS_ADMIN is simply wrong: for example I can't even fetch
links for /proc/self/map_files. Still /proc/$pid/maps (which as well points
to the files opened) test for ptrace-read permission. I think we need
ptrace-may-attach test here instead of these capabilities (if I can attach
to a process I can read any data needed, including the content of the
mapped files, if only I'm not missing something obvious).

2020-06-09 20:41:20

by Nicolas Viennot

[permalink] [raw]
Subject: RE: [PATCH v2 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

>> proc_map_files_get_link(struct dentry *dentry,
>> struct inode *inode,
>> struct delayed_call *done)
>> {
>> - if (!capable(CAP_SYS_ADMIN))
>> + if (!(capable(CAP_SYS_ADMIN) || capable(CAP_CHECKPOINT_RESTORE)))
>> return ERR_PTR(-EPERM);

> First of all -- sorry for late reply. You know, looking into this code more I think this CAP_SYS_ADMIN is simply wrong: for example I can't even fetch links for /proc/self/map_files. Still /proc/$pid/maps (which as well points to the files opened) test for ptrace-read permission. I think we need ptrace-may-attach test here instead of these capabilities (if I can attach to a process I can read any data needed, including the content of the mapped files, if only I'm not missing something obvious).

Currently /proc/pid/map_files/* have exactly the same permission checks as /proc/pid/fd/*, with the exception of the extra CAP_SYS_ADMIN check. The check originated from the following discussions where 3 security issues are discussed:
http://lkml.iu.edu/hypermail/linux/kernel/1505.2/02524.html
http://lkml.iu.edu/hypermail/linux/kernel/1505.2/04030.html

From what I understand, the extra CAP_SYS_ADMIN comes from the following issues:
1. Being able to open dma-buf / kdbus region (referred in the referenced email as problem #1). I don't fully understand what the dangers are, but perhaps we could do CAP_SYS_ADMIN check only for such dangerous files, as opposed to all files.
2. /proc/pid/fd/* is already a security hole (Andy says "I hope to fix that some day"). He essentially says that it's not because fds are insecure that map_files should be too. He seems to claim that mapped files that are then closed seems to be a bigger concern than other opened files. However, in the present time (5 years after these email conversations), the fd directory does not have the CAP_SYS_ADMIN check which doesn't convinces me that the holes of /proc/pid/fd/* are such a big of a deal. I'm not entirely sure what security issue Andy refers to, but, I understand something along the lines of: Some process gets an fd of a file read-only opened (via a unix socket for example, or after a chroot), and gets to re-open the file in write access via /proc/self/fd/N to do some damage.
3. Being able to ftruncate a file after a chroot+privilege drop. I may be wrong, but if privileges were dropped, then there's no reason that the then unprivileged user would have write access to the mmaped file inode. Seems a false problem.

It turns out that some of these concerns have been addressed with the introduction of memfd with seals, introduced around the same time where the map_files discussions took place. These seals allow one to share write access of an mmap region to an unsecure program, without fearing of getting a SIGBUS because the unsecure program could call ftruncate() on the fd. More on that at https://lwn.net/Articles/593918/ . Also, that article says "There are a number of fairly immediate use cases for the sealing functionality in general. Graphics drivers could use it to safely receive buffers from applications. The upcoming kdbus transport can benefit from sealing.". This rings a bell with problem #1. Perhaps memfd is a solution to Andy's concerns?

Overall, I think the CAP_SYS_ADMIN map_files/ extra check compared to fd/ does not improve security in practice. Fds will be given to insecure programs. Better security can be achieved with memfd seals, and sane permissioning on files, regardless if they were once closed.

I think Adrian added a CAP_CHECKPOINT_RESTORE on the map_files to avoid opening a can of worm. But I guess the cat is out of the bag now.

-Nico

2020-06-09 21:12:20

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

Nicolas Viennot <[email protected]> writes:

>>> proc_map_files_get_link(struct dentry *dentry,
>>> struct inode *inode,
>>> struct delayed_call *done)
>>> {
>>> - if (!capable(CAP_SYS_ADMIN))
>>> + if (!(capable(CAP_SYS_ADMIN) || capable(CAP_CHECKPOINT_RESTORE)))
>>> return ERR_PTR(-EPERM);
>
>> First of all -- sorry for late reply. You know, looking into this
>> code more I think this CAP_SYS_ADMIN is simply wrong: for example I
>> can't even fetch links for /proc/self/map_files. Still
>> /proc/$pid/maps (which as well points to the files opened) test for
>> ptrace-read permission. I think we need ptrace-may-attach test here
>> instead of these capabilities (if I can attach to a process I can
>> read any data needed, including the content of the mapped files, if
>> only I'm not missing something obvious).
>
> Currently /proc/pid/map_files/* have exactly the same permission
> checks as /proc/pid/fd/*, with the exception of the extra
> CAP_SYS_ADMIN check. The check originated from the following
> discussions where 3 security issues are discussed:
> http://lkml.iu.edu/hypermail/linux/kernel/1505.2/02524.html
> http://lkml.iu.edu/hypermail/linux/kernel/1505.2/04030.html
>
> From what I understand, the extra CAP_SYS_ADMIN comes from the
> following issues:

> 1. Being able to open dma-buf / kdbus region (referred in the
> referenced email as problem #1). I don't fully understand what the
> dangers are, but perhaps we could do CAP_SYS_ADMIN check only for such
> dangerous files, as opposed to all files.

I don't know precisely the concern but my memory is that some drivers do
interesting things when mmaped. Possibly even to changing the vm_file.

I think that is worth running to the ground and figuring out in the
context of checkpoint/restart because the ordinary checkpoint/restart
code won't be able deal with them either.

So I vote for figuring that case out and dealing with it.


> 2. /proc/pid/fd/* is already a security hole (Andy says "I hope to fix
> that some day"). He essentially says that it's not because fds are
> insecure that map_files should be too. He seems to claim that mapped
> files that are then closed seems to be a bigger concern than other
> opened files. However, in the present time (5 years after these email
> conversations), the fd directory does not have the CAP_SYS_ADMIN check
> which doesn't convinces me that the holes of /proc/pid/fd/* are such a
> big of a deal. I'm not entirely sure what security issue Andy refers
> to, but, I understand something along the lines of: Some process gets
> an fd of a file read-only opened (via a unix socket for example, or
> after a chroot), and gets to re-open the file in write access via
> /proc/self/fd/N to do some damage.

I would hope the other permission checks on such a file will prevent
some of that nonsense. But definitely worth taking a hard look at.

> 3. Being able to ftruncate a file after a chroot+privilege drop. I may
> be wrong, but if privileges were dropped, then there's no reason that
> the then unprivileged user would have write access to the mmaped file
> inode. Seems a false problem.

Yes.

> It turns out that some of these concerns have been addressed with the
> introduction of memfd with seals, introduced around the same time
> where the map_files discussions took place. These seals allow one to
> share write access of an mmap region to an unsecure program, without
> fearing of getting a SIGBUS because the unsecure program could call
> ftruncate() on the fd. More on that at
> https://lwn.net/Articles/593918/ . Also, that article says "There are
> a number of fairly immediate use cases for the sealing functionality
> in general. Graphics drivers could use it to safely receive buffers
> from applications. The upcoming kdbus transport can benefit from
> sealing.". This rings a bell with problem #1. Perhaps memfd is a
> solution to Andy's concerns?

> Overall, I think the CAP_SYS_ADMIN map_files/ extra check compared to
> fd/ does not improve security in practice. Fds will be given to
> insecure programs. Better security can be achieved with memfd seals,
> and sane permissioning on files, regardless if they were once closed.

I would love to see the work put in to safely relax the permission check
from capable to ns_capable. Which is just dealing with point 1.

There might be some other assumptions that a process can't get at mmaped
regions.

Eric



2020-06-09 21:30:51

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

On Tue, Jun 09, 2020 at 08:09:49PM +0000, Nicolas Viennot wrote:
> >> proc_map_files_get_link(struct dentry *dentry,
> >> struct inode *inode,
> >> struct delayed_call *done)
> >> {
> >> - if (!capable(CAP_SYS_ADMIN))
> >> + if (!(capable(CAP_SYS_ADMIN) || capable(CAP_CHECKPOINT_RESTORE)))
> >> return ERR_PTR(-EPERM);
>
> > First of all -- sorry for late reply. You know, looking into this code more I think
> this CAP_SYS_ADMIN is simply wrong: for example I can't even fetch links for /proc/self/map_files.
> Still /proc/$pid/maps (which as well points to the files opened) test for ptrace-read permission.
> I think we need ptrace-may-attach test here instead of these capabilities (if I can attach to
> a process I can read any data needed, including the content of the mapped files, if only
> I'm not missing something obvious).
>

Nikolas, could you please split the text lines next time, I've had to add newlines into reply manually :)

> Currently /proc/pid/map_files/* have exactly the same permission checks as /proc/pid/fd/*, with the exception
> of the extra CAP_SYS_ADMIN check. The check originated from the following discussions where 3 security issues are discussed:
> http://lkml.iu.edu/hypermail/linux/kernel/1505.2/02524.html
> http://lkml.iu.edu/hypermail/linux/kernel/1505.2/04030.html
>
> From what I understand, the extra CAP_SYS_ADMIN comes from the following issues:
> 1. Being able to open dma-buf / kdbus region (referred in the referenced email as problem #1).
> I don't fully understand what the dangers are, but perhaps we could do CAP_SYS_ADMIN check
> only for such dangerous files, as opposed to all files.

As far as I remember we only need to read the content of mmap'ed files and if I've ptrace-attach
permission we aready can inject own code into a process and read anything we wish. That said we probably
should fixup this interface like -- test for open mode and if it is read only then ptrace-attach
should be enough, if it is write mode -- then we require being node's admin instead of just adding
a new capability here. And thanks a huge for mail reference, I'll take a look once time permit.

2020-06-10 08:02:18

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

On Tue, Jun 09, 2020 at 06:14:27PM +0200, Christian Brauner wrote:
> On Tue, Jun 09, 2020 at 09:06:27AM -0700, Andrei Vagin wrote:
> > On Tue, Jun 09, 2020 at 09:44:22AM +0200, Christian Brauner wrote:
> > > On Mon, Jun 08, 2020 at 08:42:21PM -0700, Andrei Vagin wrote:
...
> > > > PTRACE_O_SUSPEND_SECCOMP is needed for C/R and it is protected by
> > > > CAP_SYS_ADMIN too.
> > >
> > > This is currently capable(CAP_SYS_ADMIN) (init_ns capable) why is it
> > > safe to allow unprivileged users to suspend security policies? That
> > > sounds like a bad idea.
> >
...
> > I don't suggest to remove or
> > downgrade this capability check. The patch allows all c/r related
> > operations if the current has CAP_CHECKPOINT_RESTORE.
> >
> > So in this case the check:
> > if (!capable(CAP_SYS_ADMIN))
> > return -EPERM;
> >
> > will be converted in:
> > if (!capable(CAP_SYS_ADMIN) && !capable(CAP_CHECKPOINT_RESTORE))
> > return -EPERM;
>
> Yeah, I got that but what's the goal here? Isn't it that you want to
> make it safe to install the criu binary with the CAP_CHECKPOINT_RESTORE
> fscap set so that unprivileged users can restore their own processes
> without creating a new user namespace or am I missing something? The
> use-cases in the cover-letter make it sound like that's what this is
> leading up to:
> > > > > * Checkpoint/Restore in an HPC environment in combination with a resource
> > > > > manager distributing jobs where users are always running as non-root.
> > > > > There is a desire to provide a way to checkpoint and restore long running
> > > > > jobs.
> > > > > * Container migration as non-root
> > > > > * We have been in contact with JVM developers who are integrating
> > > > > CRIU into a Java VM to decrease the startup time. These checkpoint/restore
> > > > > applications are not meant to be running with CAP_SYS_ADMIN.
>
> But maybe I'm just misunderstanding crucial bits (likely (TM)).

I think you understand this right. The goal is to make it possible to
use C/R functionality for unprivileged processes. And for me, here are
two separate tasks. The first one is how to allow unprivileged users to
use C/R from the root user namespace. This is what we discuss here.

And another one is how to allow to use C/R functionality from a non-root
user namespaces. The second task is about downgrading capable to
ns_capable for map_files and PTRACE_O_SUSPEND_SECCOMP.

Thanks,
Andrei

2020-06-10 20:01:55

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE

On Wed, Jun 10, 2020 at 08:41:29AM -0700, Casey Schaufler wrote:
>
> On 6/10/2020 12:59 AM, Andrei Vagin wrote:
> > On Tue, Jun 09, 2020 at 06:14:27PM +0200, Christian Brauner wrote:
> >> On Tue, Jun 09, 2020 at 09:06:27AM -0700, Andrei Vagin wrote:
> >>> On Tue, Jun 09, 2020 at 09:44:22AM +0200, Christian Brauner wrote:
> >>>> On Mon, Jun 08, 2020 at 08:42:21PM -0700, Andrei Vagin wrote:
> > ...
> >>>>> PTRACE_O_SUSPEND_SECCOMP is needed for C/R and it is protected by
> >>>>> CAP_SYS_ADMIN too.
> >>>> This is currently capable(CAP_SYS_ADMIN) (init_ns capable) why is it
> >>>> safe to allow unprivileged users to suspend security policies? That
> >>>> sounds like a bad idea.
> > ...
> >>> I don't suggest to remove or
> >>> downgrade this capability check. The patch allows all c/r related
> >>> operations if the current has CAP_CHECKPOINT_RESTORE.
> >>>
> >>> So in this case the check:
> >>> if (!capable(CAP_SYS_ADMIN))
> >>> return -EPERM;
> >>>
> >>> will be converted in:
> >>> if (!capable(CAP_SYS_ADMIN) && !capable(CAP_CHECKPOINT_RESTORE))
> >>> return -EPERM;
> >> Yeah, I got that but what's the goal here? Isn't it that you want to
> >> make it safe to install the criu binary with the CAP_CHECKPOINT_RESTORE
> >> fscap set so that unprivileged users can restore their own processes
> >> without creating a new user namespace or am I missing something? The
> >> use-cases in the cover-letter make it sound like that's what this is
> >> leading up to:
> >>>>>> * Checkpoint/Restore in an HPC environment in combination with a resource
> >>>>>> manager distributing jobs where users are always running as non-root.
> >>>>>> There is a desire to provide a way to checkpoint and restore long running
> >>>>>> jobs.
> >>>>>> * Container migration as non-root
> >>>>>> * We have been in contact with JVM developers who are integrating
> >>>>>> CRIU into a Java VM to decrease the startup time. These checkpoint/restore
> >>>>>> applications are not meant to be running with CAP_SYS_ADMIN.
> >> But maybe I'm just misunderstanding crucial bits (likely (TM)).
> > I think you understand this right. The goal is to make it possible to
> > use C/R functionality for unprivileged processes.
>
> Y'all keep saying "unprivileged processes" when you mean
> "processes with less than root privilege". A process with
> CAP_CHECKPOINT_RESTORE *is* a privileged process. It would

That was me being imprecise. What I mean is "unprivileged user"
not "unprivileged process". It makes me a little uneasy that an
unprivileged _user_ can call the criu binary with the
CAP_CHECKPOINT_RESTORE fscap set and suspend seccomp of a process (Which
is what my original question here was about). Maybe this is paranoia but
shouldn't suspending _security_ mechanisms be kept either under
CAP_SYS_ADMIN or CAP_MAC_ADMIN?

Christian

2020-06-10 20:02:08

by Casey Schaufler

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE


On 6/10/2020 12:59 AM, Andrei Vagin wrote:
> On Tue, Jun 09, 2020 at 06:14:27PM +0200, Christian Brauner wrote:
>> On Tue, Jun 09, 2020 at 09:06:27AM -0700, Andrei Vagin wrote:
>>> On Tue, Jun 09, 2020 at 09:44:22AM +0200, Christian Brauner wrote:
>>>> On Mon, Jun 08, 2020 at 08:42:21PM -0700, Andrei Vagin wrote:
> ...
>>>>> PTRACE_O_SUSPEND_SECCOMP is needed for C/R and it is protected by
>>>>> CAP_SYS_ADMIN too.
>>>> This is currently capable(CAP_SYS_ADMIN) (init_ns capable) why is it
>>>> safe to allow unprivileged users to suspend security policies? That
>>>> sounds like a bad idea.
> ...
>>> I don't suggest to remove or
>>> downgrade this capability check. The patch allows all c/r related
>>> operations if the current has CAP_CHECKPOINT_RESTORE.
>>>
>>> So in this case the check:
>>> if (!capable(CAP_SYS_ADMIN))
>>> return -EPERM;
>>>
>>> will be converted in:
>>> if (!capable(CAP_SYS_ADMIN) && !capable(CAP_CHECKPOINT_RESTORE))
>>> return -EPERM;
>> Yeah, I got that but what's the goal here? Isn't it that you want to
>> make it safe to install the criu binary with the CAP_CHECKPOINT_RESTORE
>> fscap set so that unprivileged users can restore their own processes
>> without creating a new user namespace or am I missing something? The
>> use-cases in the cover-letter make it sound like that's what this is
>> leading up to:
>>>>>> * Checkpoint/Restore in an HPC environment in combination with a resource
>>>>>> manager distributing jobs where users are always running as non-root.
>>>>>> There is a desire to provide a way to checkpoint and restore long running
>>>>>> jobs.
>>>>>> * Container migration as non-root
>>>>>> * We have been in contact with JVM developers who are integrating
>>>>>> CRIU into a Java VM to decrease the startup time. These checkpoint/restore
>>>>>> applications are not meant to be running with CAP_SYS_ADMIN.
>> But maybe I'm just misunderstanding crucial bits (likely (TM)).
> I think you understand this right. The goal is to make it possible to
> use C/R functionality for unprivileged processes.

Y'all keep saying "unprivileged processes" when you mean
"processes with less than root privilege". A process with
CAP_CHECKPOINT_RESTORE *is* a privileged process. It would
have different privilege from a process with CAP_SYS_ADMIN
(the current case) but is not "unprivileged".

> And for me, here are
> two separate tasks. The first one is how to allow unprivileged users to
> use C/R from the root user namespace. This is what we discuss here.
>
> And another one is how to allow to use C/R functionality from a non-root
> user namespaces. The second task is about downgrading capable to
> ns_capable for map_files and PTRACE_O_SUSPEND_SECCOMP.
>
> Thanks,
> Andrei