LinuxLists.cc - [PATCH v6 0/7] capabilities: Introduce CAP_CHECKPOINT

2020-07-19 10:06:15

Subject: [PATCH v6 0/7] capabilities: Introduce CAP_CHECKPOINT_RESTORE

This is v6 of the 'Introduce CAP_CHECKPOINT_RESTORE' patchset. The
changes to v5 are:

* split patch dealing with /proc/self/exe into two patches:
* first patch to enable changing it with CAP_CHECKPOINT_RESTORE
and detailed history in the commit message
* second patch changes -EINVAL to -EPERM
* use kselftest_harness.h infrastructure for test
* replace if (!capable(CAP_SYS_ADMIN) || !capable(CAP_CHECKPOINT_RESTORE))
with if (!checkpoint_restore_ns_capable(&init_user_ns))

Adrian Reber (5):
capabilities: Introduce CAP_CHECKPOINT_RESTORE
pid: use checkpoint_restore_ns_capable() for set_tid
pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
proc: allow access in init userns for map_files with
CAP_CHECKPOINT_RESTORE
selftests: add clone3() CAP_CHECKPOINT_RESTORE test

Nicolas Viennot (2):
prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
prctl: exe link permission error changed from -EINVAL to -EPERM

fs/proc/base.c | 8 +-
include/linux/capability.h | 6 +
include/uapi/linux/capability.h | 9 +-
kernel/pid.c | 2 +-
kernel/pid_namespace.c | 2 +-
kernel/sys.c | 13 +-
security/selinux/include/classmap.h | 5 +-
tools/testing/selftests/clone3/.gitignore | 1 +
tools/testing/selftests/clone3/Makefile | 4 +-
.../clone3/clone3_cap_checkpoint_restore.c | 177 ++++++++++++++++++
10 files changed, 212 insertions(+), 15 deletions(-)
create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c

base-commit: d31958b30ea3b7b6e522d6bf449427748ad45822
--
2.26.2

2020-07-19 10:06:26

by Adrian Reber

[permalink] [raw]

Subject: [PATCH v6 1/7] capabilities: Introduce CAP_CHECKPOINT_RESTORE

This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
checkpoint/restore for non-root users.

Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has
been asked numerous times if it is possible to checkpoint/restore a
process as non-root. The answer usually was: 'almost'.

The main blocker to restore a process as non-root was to control the PID
of the restored process. This feature available via the clone3 system
call, or via /proc/sys/kernel/ns_last_pid is unfortunately guarded by
CAP_SYS_ADMIN.

In the past two years, requests for non-root checkpoint/restore have
increased due to the following use cases:
* Checkpoint/Restore in an HPC environment in combination with a
resource manager distributing jobs where users are always running as
non-root. There is a desire to provide a way to checkpoint and
restore long running jobs.
* Container migration as non-root
* We have been in contact with JVM developers who are integrating
CRIU into a Java VM to decrease the startup time. These
checkpoint/restore applications are not meant to be running with
CAP_SYS_ADMIN.

We have seen the following workarounds:
* Use a setuid wrapper around CRIU:
See https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
* Use a setuid helper that writes to ns_last_pid.
Unfortunately, this helper delegation technique is impossible to use
with clone3, and is thus prone to races.
See https://github.com/twosigma/set_ns_last_pid
* Cycle through PIDs with fork() until the desired PID is reached:
This has been demonstrated to work with cycling rates of 100,000 PIDs/s
See https://github.com/twosigma/set_ns_last_pid
* Patch out the CAP_SYS_ADMIN check from the kernel
* Run the desired application in a new user and PID namespace to provide
a local CAP_SYS_ADMIN for controlling PIDs. This technique has limited
use in typical container environments (e.g., Kubernetes) as /proc is
typically protected with read-only layers (e.g., /proc/sys) for
hardening purposes. Read-only layers prevent additional /proc mounts
(due to proc's SB_I_USERNS_VISIBLE property), making the use of new
PID namespaces limited as certain applications need access to /proc
matching their PID namespace.

The introduced capability allows to:
* Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
for the corresponding PID namespace via ns_last_pid/clone3.
* Open files in /proc/pid/map_files when the current user is
CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for
recovering files that are unreachable via the file system such as
deleted files, or memfd files.

See corresponding selftest for an example with clone3().

Signed-off-by: Adrian Reber <[email protected]>
Signed-off-by: Nicolas Viennot <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Reviewed-by: Serge Hallyn <[email protected]>
---
include/linux/capability.h | 6 ++++++
include/uapi/linux/capability.h | 9 ++++++++-
security/selinux/include/classmap.h | 5 +++--
3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index b4345b38a6be..1e7fe311cabe 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -261,6 +261,12 @@ static inline bool bpf_capable(void)
return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
}

+static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
+{
+ return ns_capable(ns, CAP_CHECKPOINT_RESTORE) ||
+ ns_capable(ns, CAP_SYS_ADMIN);
+}
+
/* audit system wants to get cap info from files as well */
extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);

diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 48ff0757ae5e..395dd0df8d08 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -408,7 +408,14 @@ struct vfs_ns_cap_data {
*/
#define CAP_BPF 39

-#define CAP_LAST_CAP CAP_BPF
+
+/* Allow checkpoint/restore related operations */
+/* Allow PID selection during clone3() */
+/* Allow writing to ns_last_pid */
+
+#define CAP_CHECKPOINT_RESTORE 40
+
+#define CAP_LAST_CAP CAP_CHECKPOINT_RESTORE

#define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)

diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index e54d62d529f1..ba2e01a6955c 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -27,9 +27,10 @@
"audit_control", "setfcap"

#define COMMON_CAP2_PERMS "mac_override", "mac_admin", "syslog", \
- "wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf"
+ "wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf", \
+ "checkpoint_restore"

-#if CAP_LAST_CAP > CAP_BPF
+#if CAP_LAST_CAP > CAP_CHECKPOINT_RESTORE
#error New capability defined, please update COMMON_CAP2_PERMS.
#endif

--
2.26.2

2020-07-19 10:06:39

by Adrian Reber

[permalink] [raw]

Subject: [PATCH v6 2/7] pid: use checkpoint_restore_ns_capable() for set_tid

Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow
using clone3() with set_tid set.

Signed-off-by: Adrian Reber <[email protected]>
Signed-off-by: Nicolas Viennot <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Reviewed-by: Serge Hallyn <[email protected]>
---
kernel/pid.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index de9d29c41d77..a9cbab0194d9 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -199,7 +199,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
if (tid != 1 && !tmp->child_reaper)
goto out_free;
retval = -EPERM;
- if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
+ if (!checkpoint_restore_ns_capable(tmp->user_ns))
goto out_free;
set_tid_size--;
}
--
2.26.2

2020-07-19 10:06:56

by Adrian Reber

[permalink] [raw]

Subject: [PATCH v6 3/7] pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid

Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow
writing to ns_last_pid.

Signed-off-by: Adrian Reber <[email protected]>
Signed-off-by: Nicolas Viennot <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Reviewed-by: Serge Hallyn <[email protected]>
---
kernel/pid_namespace.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 0e5ac162c3a8..ac135bd600eb 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -269,7 +269,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
struct ctl_table tmp = *table;
int ret, next;

- if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
+ if (write && !checkpoint_restore_ns_capable(pid_ns->user_ns))
return -EPERM;

/*
--
2.26.2

2020-07-19 10:08:02

by Adrian Reber

[permalink] [raw]

Subject: [PATCH v6 6/7] prctl: exe link permission error changed from -EINVAL to -EPERM

From: Nicolas Viennot <[email protected]>

This brings consistency with the rest of the prctl() syscall where
-EPERM is returned when failing a capability check.

Signed-off-by: Nicolas Viennot <[email protected]>
Signed-off-by: Adrian Reber <[email protected]>
---
kernel/sys.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index a3f4ef0bbda3..ca11af9d815d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2015,7 +2015,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
* This may have implications in the tomoyo subsystem.
*/
if (!checkpoint_restore_ns_capable(current_user_ns()))
- return -EINVAL;
+ return -EPERM;

error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd);
if (error)
--
2.26.2

2020-07-19 10:08:18

by Adrian Reber

[permalink] [raw]

Subject: [PATCH v6 7/7] selftests: add clone3() CAP_CHECKPOINT_RESTORE test

This adds a test that changes its UID, uses capabilities to
get CAP_CHECKPOINT_RESTORE and uses clone3() with set_tid to
create a process with a given PID as non-root.

Signed-off-by: Adrian Reber <[email protected]>
---
tools/testing/selftests/clone3/.gitignore | 1 +
tools/testing/selftests/clone3/Makefile | 4 +-
.../clone3/clone3_cap_checkpoint_restore.c | 177 ++++++++++++++++++
3 files changed, 181 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c

diff --git a/tools/testing/selftests/clone3/.gitignore b/tools/testing/selftests/clone3/.gitignore
index a81085742d40..83c0f6246055 100644
--- a/tools/testing/selftests/clone3/.gitignore
+++ b/tools/testing/selftests/clone3/.gitignore
@@ -2,3 +2,4 @@
clone3
clone3_clear_sighand
clone3_set_tid
+clone3_cap_checkpoint_restore
diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile
index cf976c732906..ef7564cb7abe 100644
--- a/tools/testing/selftests/clone3/Makefile
+++ b/tools/testing/selftests/clone3/Makefile
@@ -1,6 +1,8 @@
# SPDX-License-Identifier: GPL-2.0
CFLAGS += -g -I../../../../usr/include/
+LDLIBS += -lcap

-TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
+TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \
+ clone3_cap_checkpoint_restore

include ../lib.mk
diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
new file mode 100644
index 000000000000..c0d83511cd28
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
@@ -0,0 +1,177 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Based on Christian Brauner's clone3() example.
+ * These tests are assuming to be running in the host's
+ * PID namespace.
+ */
+
+/* capabilities related code based on selftests/bpf/test_verifier.c */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <linux/types.h>
+#include <linux/sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <sys/capability.h>
+#include <sys/prctl.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/un.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <sched.h>
+
+#include "../kselftest_harness.h"
+#include "clone3_selftests.h"
+
+#ifndef MAX_PID_NS_LEVEL
+#define MAX_PID_NS_LEVEL 32
+#endif
+
+static void child_exit(int ret)
+{
+ fflush(stdout);
+ fflush(stderr);
+ _exit(ret);
+}
+
+static int call_clone3_set_tid(pid_t *set_tid, size_t set_tid_size)
+{
+ int status;
+ pid_t pid = -1;
+
+ struct clone_args args = {
+ .exit_signal = SIGCHLD,
+ .set_tid = ptr_to_u64(set_tid),
+ .set_tid_size = set_tid_size,
+ };
+
+ pid = sys_clone3(&args, sizeof(struct clone_args));
+ if (pid < 0) {
+ ksft_print_msg("%s - Failed to create new process\n", strerror(errno));
+ return -errno;
+ }
+
+ if (pid == 0) {
+ int ret;
+ char tmp = 0;
+
+ ksft_print_msg
+ ("I am the child, my PID is %d (expected %d)\n", getpid(), set_tid[0]);
+
+ if (set_tid[0] != getpid())
+ child_exit(EXIT_FAILURE);
+ child_exit(EXIT_SUCCESS);
+ }
+
+ ksft_print_msg("I am the parent (%d). My child's pid is %d\n", getpid(), pid);
+
+ if (waitpid(pid, &status, 0) < 0) {
+ ksft_print_msg("Child returned %s\n", strerror(errno));
+ return -errno;
+ }
+
+ if (!WIFEXITED(status))
+ return -1;
+
+ return WEXITSTATUS(status);
+}
+
+static int test_clone3_set_tid(pid_t *set_tid, size_t set_tid_size)
+{
+ int ret;
+
+ ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d\n", getpid(), set_tid[0]);
+ ret = call_clone3_set_tid(set_tid, set_tid_size);
+ ksft_print_msg("[%d] clone3() with CLONE_SET_TID %d says:%d\n", getpid(), set_tid[0], ret);
+ return ret;
+}
+
+struct libcap {
+ struct __user_cap_header_struct hdr;
+ struct __user_cap_data_struct data[2];
+};
+
+static int set_capability(void)
+{
+ cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
+ struct libcap *cap;
+ int ret = -1;
+ cap_t caps;
+
+ caps = cap_get_proc();
+ if (!caps) {
+ perror("cap_get_proc");
+ return -1;
+ }
+
+ /* Drop all capabilities */
+ if (cap_clear(caps)) {
+ perror("cap_clear");
+ goto out;
+ }
+
+ cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
+ cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
+
+ cap = (struct libcap *) caps;
+
+ /* 40 -> CAP_CHECKPOINT_RESTORE */
+ cap->data[1].effective |= 1 << (40 - 32);
+ cap->data[1].permitted |= 1 << (40 - 32);
+
+ if (cap_set_proc(caps)) {
+ perror("cap_set_proc");
+ goto out;
+ }
+ ret = 0;
+out:
+ if (cap_free(caps))
+ perror("cap_free");
+ return ret;
+}
+
+TEST(clone3_cap_checkpoint_restore)
+{
+ pid_t pid;
+ int status;
+ int ret = 0;
+ pid_t set_tid[1];
+
+ test_clone3_supported();
+
+ EXPECT_EQ(getuid(), 0)
+ SKIP(return, "Skipping all tests as non-root\n");
+
+ memset(&set_tid, 0, sizeof(set_tid));
+
+ /* Find the current active PID */
+ pid = fork();
+ if (pid == 0) {
+ TH_LOG("Child has PID %d", getpid());
+ child_exit(EXIT_SUCCESS);
+ }
+ ASSERT_GT(waitpid(pid, &status, 0), 0)
+ TH_LOG("Waiting for child %d failed", pid);
+
+ /* After the child has finished, its PID should be free. */
+ set_tid[0] = pid;
+
+ ASSERT_EQ(set_capability(), 0)
+ TH_LOG("Could not set CAP_CHECKPOINT_RESTORE");
+ prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0);
+ setgid(1000);
+ setuid(1000);
+ set_tid[0] = pid;
+ /* This would fail without CAP_CHECKPOINT_RESTORE */
+ ASSERT_EQ(test_clone3_set_tid(set_tid, 1), -EPERM);
+ ASSERT_EQ(set_capability(), 0)
+ TH_LOG("Could not set CAP_CHECKPOINT_RESTORE");
+ /* This should work as we have CAP_CHECKPOINT_RESTORE as non-root */
+ ASSERT_EQ(test_clone3_set_tid(set_tid, 1), 0);
+}
+
+TEST_HARNESS_MAIN
--
2.26.2

2020-07-19 10:09:12

by Adrian Reber

[permalink] [raw]

Subject: [PATCH v6 4/7] proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE

Opening files in /proc/pid/map_files when the current user is
CAP_CHECKPOINT_RESTORE capable in the root namespace is useful for
checkpointing and restoring to recover files that are unreachable via
the file system such as deleted files, or memfd files.

Signed-off-by: Adrian Reber <[email protected]>
Signed-off-by: Nicolas Viennot <[email protected]>
Reviewed-by: Cyrill Gorcunov <[email protected]>
---
fs/proc/base.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 65893686d1f1..b824a8c89011 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2194,16 +2194,16 @@ struct map_files_info {
};

/*
- * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
- * symlinks may be used to bypass permissions on ancestor directories in the
- * path to the file in question.
+ * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, due
+ * to concerns about how the symlinks may be used to bypass permissions on
+ * ancestor directories in the path to the file in question.
*/
static const char *
proc_map_files_get_link(struct dentry *dentry,
struct inode *inode,
struct delayed_call *done)
{
- if (!capable(CAP_SYS_ADMIN))
+ if (!checkpoint_restore_ns_capable(&init_user_ns))
return ERR_PTR(-EPERM);

return proc_pid_get_link(dentry, inode, done);
--
2.26.2

2020-07-19 10:09:30

by Adrian Reber

[permalink] [raw]

Subject: [PATCH v6 5/7] prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe

From: Nicolas Viennot <[email protected]>

Originally, only a local CAP_SYS_ADMIN could change the exe link,
making it difficult for doing checkpoint/restore without CAP_SYS_ADMIN.
This commit adds CAP_CHECKPOINT_RESTORE in addition to CAP_SYS_ADMIN
for permitting changing the exe link.

The following describes the history of the /proc/self/exe permission
checks as it may be difficult to understand what decisions lead to this
point.

* [1] May 2012: This commit introduces the ability of changing
/proc/self/exe if the user is CAP_SYS_RESOURCE capable.
In the related discussion [2], no clear thread model is presented for
what could happen if the /proc/self/exe changes multiple times, or why
would the admin be at the mercy of userspace.

* [3] Oct 2014: This commit introduces a new API to change
/proc/self/exe. The permission no longer checks for CAP_SYS_RESOURCE,
but instead checks if the current user is root (uid=0) in its local
namespace. In the related discussion [4] it is said that "Controlling
exe_fd without privileges may turn out to be dangerous. At least
things like tomoyo examine it for making policy decisions (see
tomoyo_manager())."

* [5] Dec 2016: This commit removes the restriction to change
/proc/self/exe at most once. The related discussion [6] informs that
the audit subsystem relies on the exe symlink, presumably
audit_log_d_path_exe() in kernel/audit.c.

* [7] May 2017: This commit changed the check from uid==0 to local
CAP_SYS_ADMIN. No discussion.

* [8] July 2020: A PoC to spoof any program's /proc/self/exe via ptrace
is demonstrated

Overall, the concrete points that were made to retain capability checks
around changing the exe symlink is that tomoyo_manager() and
audit_log_d_path_exe() uses the exe_file path.

Christian Brauner said that relying on /proc/<pid>/exe being immutable (or
guarded by caps) in a sake of security is a bit misleading. It can only
be used as a hint without any guarantees of what code is being executed
once execve() returns to userspace. Christian suggested that in the
future, we could call audit_log() or similar to inform the admin of all
exe link changes, instead of attempting to provide security guarantees
via permission checks. However, this proposed change requires the
understanding of the security implications in the tomoyo/audit subsystems.

[1] b32dfe377102 ("c/r: prctl: add ability to set new mm_struct::exe_file")
[2] https://lore.kernel.org/patchwork/patch/292515/
[3] f606b77f1a9e ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
[4] https://lore.kernel.org/patchwork/patch/479359/
[5] 3fb4afd9a504 ("prctl: remove one-shot limitation for changing exe link")
[6] https://lore.kernel.org/patchwork/patch/697304/
[7] 4d28df6152aa ("prctl: Allow local CAP_SYS_ADMIN changing exe_file")
[8] https://github.com/nviennot/run_as_exe

Signed-off-by: Nicolas Viennot <[email protected]>
Signed-off-by: Adrian Reber <[email protected]>
---
kernel/sys.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 00a96746e28a..a3f4ef0bbda3 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2007,11 +2007,14 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data

if (prctl_map.exe_fd != (u32)-1) {
/*
- * Make sure the caller has the rights to
- * change /proc/pid/exe link: only local sys admin should
- * be allowed to.
+ * Check if the current user is checkpoint/restore capable.
+ * At the time of this writing, it checks for CAP_SYS_ADMIN
+ * or CAP_CHECKPOINT_RESTORE.
+ * Note that a user with access to ptrace can masquerade an
+ * arbitrary program as any executable, even setuid ones.
+ * This may have implications in the tomoyo subsystem.
*/
- if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
+ if (!checkpoint_restore_ns_capable(current_user_ns()))
return -EINVAL;

error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd);
--
2.26.2

2020-07-19 16:51:44

by Serge E. Hallyn

[permalink] [raw]

Subject: Re: [PATCH v6 4/7] proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE

On Sun, Jul 19, 2020 at 12:04:14PM +0200, Adrian Reber wrote:
> Opening files in /proc/pid/map_files when the current user is
> CAP_CHECKPOINT_RESTORE capable in the root namespace is useful for
> checkpointing and restoring to recover files that are unreachable via
> the file system such as deleted files, or memfd files.
>
> Signed-off-by: Adrian Reber <[email protected]>

Reviewed-by: Serge Hallyn <[email protected]>

> Signed-off-by: Nicolas Viennot <[email protected]>
> Reviewed-by: Cyrill Gorcunov <[email protected]>
> ---
> fs/proc/base.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 65893686d1f1..b824a8c89011 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2194,16 +2194,16 @@ struct map_files_info {
> };
>
> /*
> - * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
> - * symlinks may be used to bypass permissions on ancestor directories in the
> - * path to the file in question.
> + * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, due
> + * to concerns about how the symlinks may be used to bypass permissions on
> + * ancestor directories in the path to the file in question.
> */
> static const char *
> proc_map_files_get_link(struct dentry *dentry,
> struct inode *inode,
> struct delayed_call *done)
> {
> - if (!capable(CAP_SYS_ADMIN))
> + if (!checkpoint_restore_ns_capable(&init_user_ns))
> return ERR_PTR(-EPERM);
>
> return proc_pid_get_link(dentry, inode, done);
> --
> 2.26.2

2020-07-19 17:07:21

by Serge E. Hallyn

[permalink] [raw]

Subject: Re: [PATCH v6 6/7] prctl: exe link permission error changed from -EINVAL to -EPERM

On Sun, Jul 19, 2020 at 12:04:16PM +0200, Adrian Reber wrote:
> From: Nicolas Viennot <[email protected]>
>
> This brings consistency with the rest of the prctl() syscall where
> -EPERM is returned when failing a capability check.
>
> Signed-off-by: Nicolas Viennot <[email protected]>
> Signed-off-by: Adrian Reber <[email protected]>

Ok, i see how EINVAL snuck its way in there through validate_prctl_map()s
evolution :)

Reviewed-by: Serge Hallyn <[email protected]>

> ---
> kernel/sys.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sys.c b/kernel/sys.c
> index a3f4ef0bbda3..ca11af9d815d 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2015,7 +2015,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
> * This may have implications in the tomoyo subsystem.
> */
> if (!checkpoint_restore_ns_capable(current_user_ns()))
> - return -EINVAL;
> + return -EPERM;
>
> error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd);
> if (error)
> --
> 2.26.2

2020-07-19 18:20:23

by Christian Brauner

[permalink] [raw]

Subject: Re: [PATCH v6 0/7] capabilities: Introduce CAP_CHECKPOINT_RESTORE

On Sun, Jul 19, 2020 at 12:04:10PM +0200, Adrian Reber wrote:
> This is v6 of the 'Introduce CAP_CHECKPOINT_RESTORE' patchset. The
> changes to v5 are:
>
> * split patch dealing with /proc/self/exe into two patches:
> * first patch to enable changing it with CAP_CHECKPOINT_RESTORE
> and detailed history in the commit message
> * second patch changes -EINVAL to -EPERM
> * use kselftest_harness.h infrastructure for test
> * replace if (!capable(CAP_SYS_ADMIN) || !capable(CAP_CHECKPOINT_RESTORE))
> with if (!checkpoint_restore_ns_capable(&init_user_ns))
>
> Adrian Reber (5):
> capabilities: Introduce CAP_CHECKPOINT_RESTORE
> pid: use checkpoint_restore_ns_capable() for set_tid
> pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
> proc: allow access in init userns for map_files with
> CAP_CHECKPOINT_RESTORE
> selftests: add clone3() CAP_CHECKPOINT_RESTORE test
>
> Nicolas Viennot (2):
> prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
> prctl: exe link permission error changed from -EINVAL to -EPERM
>
> fs/proc/base.c | 8 +-
> include/linux/capability.h | 6 +
> include/uapi/linux/capability.h | 9 +-
> kernel/pid.c | 2 +-
> kernel/pid_namespace.c | 2 +-
> kernel/sys.c | 13 +-
> security/selinux/include/classmap.h | 5 +-
> tools/testing/selftests/clone3/.gitignore | 1 +
> tools/testing/selftests/clone3/Makefile | 4 +-
> .../clone3/clone3_cap_checkpoint_restore.c | 177 ++++++++++++++++++
> 10 files changed, 212 insertions(+), 15 deletions(-)
> create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
>
> base-commit: d31958b30ea3b7b6e522d6bf449427748ad45822

Adrian, Nicolas thank you!
I grabbed the series to run the various core test-suites we've added
over the last year and pushed it to
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=cap_checkpoint_restore
for now to let kbuild/ltp chew on it for a bit.

Thanks!
Christian

2020-07-20 11:58:10

by Christian Brauner

[permalink] [raw]

Subject: Re: [PATCH v6 0/7] capabilities: Introduce CAP_CHECKPOINT_RESTORE

On Sun, Jul 19, 2020 at 08:17:30PM +0200, Christian Brauner wrote:
> On Sun, Jul 19, 2020 at 12:04:10PM +0200, Adrian Reber wrote:
> > This is v6 of the 'Introduce CAP_CHECKPOINT_RESTORE' patchset. The
> > changes to v5 are:
> >
> > * split patch dealing with /proc/self/exe into two patches:
> > * first patch to enable changing it with CAP_CHECKPOINT_RESTORE
> > and detailed history in the commit message
> > * second patch changes -EINVAL to -EPERM
> > * use kselftest_harness.h infrastructure for test
> > * replace if (!capable(CAP_SYS_ADMIN) || !capable(CAP_CHECKPOINT_RESTORE))
> > with if (!checkpoint_restore_ns_capable(&init_user_ns))
> >
> > Adrian Reber (5):
> > capabilities: Introduce CAP_CHECKPOINT_RESTORE
> > pid: use checkpoint_restore_ns_capable() for set_tid
> > pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
> > proc: allow access in init userns for map_files with
> > CAP_CHECKPOINT_RESTORE
> > selftests: add clone3() CAP_CHECKPOINT_RESTORE test
> >
> > Nicolas Viennot (2):
> > prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
> > prctl: exe link permission error changed from -EINVAL to -EPERM
> >
> > fs/proc/base.c | 8 +-
> > include/linux/capability.h | 6 +
> > include/uapi/linux/capability.h | 9 +-
> > kernel/pid.c | 2 +-
> > kernel/pid_namespace.c | 2 +-
> > kernel/sys.c | 13 +-
> > security/selinux/include/classmap.h | 5 +-
> > tools/testing/selftests/clone3/.gitignore | 1 +
> > tools/testing/selftests/clone3/Makefile | 4 +-
> > .../clone3/clone3_cap_checkpoint_restore.c | 177 ++++++++++++++++++
> > 10 files changed, 212 insertions(+), 15 deletions(-)
> > create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> >
> > base-commit: d31958b30ea3b7b6e522d6bf449427748ad45822
>
> Adrian, Nicolas thank you!
> I grabbed the series to run the various core test-suites we've added
> over the last year and pushed it to
> https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=cap_checkpoint_restore
> for now to let kbuild/ltp chew on it for a bit.

Ok, I ran the test-suite this morning and there's nothing to worry about
it all passes _but_ the selftests had a bug using SKIP() instead of
XFAIL() and they mixed ksft_print_msg() and TH_LOG(). I know that I
think I mentioned to you that you can't use TH_LOG() outside of TEST*().
Turns out I was wrong. You can do it if you pass in a specific global
variable. Here's the diff I applied on top of the selftests you sent.
After these changes the output looks like this:

[==========] Running 1 tests from 1 test cases.
[ RUN ] global.clone3_cap_checkpoint_restore
# clone3() syscall supported
clone3_cap_checkpoint_restore.c:155:clone3_cap_checkpoint_restore:Child has PID 12303
clone3_cap_checkpoint_restore.c:88:clone3_cap_checkpoint_restore:[12302] Trying clone3() with CLONE_SET_TID to 12303
clone3_cap_checkpoint_restore.c:55:clone3_cap_checkpoint_restore:Operation not permitted - Failed to create new process
clone3_cap_checkpoint_restore.c:90:clone3_cap_checkpoint_restore:[12302] clone3() with CLONE_SET_TID 12303 says:-1
clone3_cap_checkpoint_restore.c:88:clone3_cap_checkpoint_restore:[12302] Trying clone3() with CLONE_SET_TID to 12303
clone3_cap_checkpoint_restore.c:70:clone3_cap_checkpoint_restore:I am the parent (12302). My child's pid is 12303
clone3_cap_checkpoint_restore.c:63:clone3_cap_checkpoint_restore:I am the child, my PID is 12303 (expected 12303)
clone3_cap_checkpoint_restore.c:90:clone3_cap_checkpoint_restore:[12302] clone3() with CLONE_SET_TID 12303 says:0
[ OK ] global.clone3_cap_checkpoint_restore
[==========] 1 / 1 tests passed.
[ PASSED ]

Ok with this below being applied on top of it?

diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
index c0d83511cd28..9562425aa0a9 100644
--- a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
+++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
@@ -38,7 +38,8 @@ static void child_exit(int ret)
_exit(ret);
}

-static int call_clone3_set_tid(pid_t *set_tid, size_t set_tid_size)
+static int call_clone3_set_tid(struct __test_metadata *_metadata,
+ pid_t *set_tid, size_t set_tid_size)
{
int status;
pid_t pid = -1;
@@ -51,7 +52,7 @@ static int call_clone3_set_tid(pid_t *set_tid, size_t set_tid_size)

pid = sys_clone3(&args, sizeof(struct clone_args));
if (pid < 0) {
- ksft_print_msg("%s - Failed to create new process\n", strerror(errno));
+ TH_LOG("%s - Failed to create new process", strerror(errno));
return -errno;
}

@@ -59,18 +60,17 @@ static int call_clone3_set_tid(pid_t *set_tid, size_t set_tid_size)
int ret;
char tmp = 0;

- ksft_print_msg
- ("I am the child, my PID is %d (expected %d)\n", getpid(), set_tid[0]);
+ TH_LOG("I am the child, my PID is %d (expected %d)", getpid(), set_tid[0]);

if (set_tid[0] != getpid())
child_exit(EXIT_FAILURE);
child_exit(EXIT_SUCCESS);
}

- ksft_print_msg("I am the parent (%d). My child's pid is %d\n", getpid(), pid);
+ TH_LOG("I am the parent (%d). My child's pid is %d", getpid(), pid);

if (waitpid(pid, &status, 0) < 0) {
- ksft_print_msg("Child returned %s\n", strerror(errno));
+ TH_LOG("Child returned %s", strerror(errno));
return -errno;
}

@@ -80,13 +80,14 @@ static int call_clone3_set_tid(pid_t *set_tid, size_t set_tid_size)
return WEXITSTATUS(status);
}

-static int test_clone3_set_tid(pid_t *set_tid, size_t set_tid_size)
+static int test_clone3_set_tid(struct __test_metadata *_metadata,
+ pid_t *set_tid, size_t set_tid_size)
{
int ret;

- ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d\n", getpid(), set_tid[0]);
- ret = call_clone3_set_tid(set_tid, set_tid_size);
- ksft_print_msg("[%d] clone3() with CLONE_SET_TID %d says:%d\n", getpid(), set_tid[0], ret);
+ TH_LOG("[%d] Trying clone3() with CLONE_SET_TID to %d", getpid(), set_tid[0]);
+ ret = call_clone3_set_tid(_metadata, set_tid, set_tid_size);
+ TH_LOG("[%d] clone3() with CLONE_SET_TID %d says:%d", getpid(), set_tid[0], ret);
return ret;
}

@@ -144,7 +145,7 @@ TEST(clone3_cap_checkpoint_restore)
test_clone3_supported();

EXPECT_EQ(getuid(), 0)
- SKIP(return, "Skipping all tests as non-root\n");
+ XFAIL(return, "Skipping all tests as non-root\n");

memset(&set_tid, 0, sizeof(set_tid));

@@ -162,16 +163,20 @@ TEST(clone3_cap_checkpoint_restore)

ASSERT_EQ(set_capability(), 0)
TH_LOG("Could not set CAP_CHECKPOINT_RESTORE");
- prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0);
- setgid(1000);
- setuid(1000);
+
+ ASSERT_EQ(prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0), 0);
+
+ EXPECT_EQ(setgid(65534), 0)
+ TH_LOG("Failed to setgid(65534)");
+ ASSERT_EQ(setuid(65534), 0);
+
set_tid[0] = pid;
/* This would fail without CAP_CHECKPOINT_RESTORE */
- ASSERT_EQ(test_clone3_set_tid(set_tid, 1), -EPERM);
+ ASSERT_EQ(test_clone3_set_tid(_metadata, set_tid, 1), -EPERM);
ASSERT_EQ(set_capability(), 0)
TH_LOG("Could not set CAP_CHECKPOINT_RESTORE");
/* This should work as we have CAP_CHECKPOINT_RESTORE as non-root */
- ASSERT_EQ(test_clone3_set_tid(set_tid, 1), 0);
+ ASSERT_EQ(test_clone3_set_tid(_metadata, set_tid, 1), 0);
}

TEST_HARNESS_MAIN

2020-07-20 12:47:41

by Adrian Reber

[permalink] [raw]

Subject: Re: [PATCH v6 0/7] capabilities: Introduce CAP_CHECKPOINT_RESTORE

On Mon, Jul 20, 2020 at 01:54:52PM +0200, Christian Brauner wrote:
> On Sun, Jul 19, 2020 at 08:17:30PM +0200, Christian Brauner wrote:
> > On Sun, Jul 19, 2020 at 12:04:10PM +0200, Adrian Reber wrote:
> > > This is v6 of the 'Introduce CAP_CHECKPOINT_RESTORE' patchset. The
> > > changes to v5 are:
> > >
> > > * split patch dealing with /proc/self/exe into two patches:
> > > * first patch to enable changing it with CAP_CHECKPOINT_RESTORE
> > > and detailed history in the commit message
> > > * second patch changes -EINVAL to -EPERM
> > > * use kselftest_harness.h infrastructure for test
> > > * replace if (!capable(CAP_SYS_ADMIN) || !capable(CAP_CHECKPOINT_RESTORE))
> > > with if (!checkpoint_restore_ns_capable(&init_user_ns))
> > >
> > > Adrian Reber (5):
> > > capabilities: Introduce CAP_CHECKPOINT_RESTORE
> > > pid: use checkpoint_restore_ns_capable() for set_tid
> > > pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
> > > proc: allow access in init userns for map_files with
> > > CAP_CHECKPOINT_RESTORE
> > > selftests: add clone3() CAP_CHECKPOINT_RESTORE test
> > >
> > > Nicolas Viennot (2):
> > > prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
> > > prctl: exe link permission error changed from -EINVAL to -EPERM
> > >
> > > fs/proc/base.c | 8 +-
> > > include/linux/capability.h | 6 +
> > > include/uapi/linux/capability.h | 9 +-
> > > kernel/pid.c | 2 +-
> > > kernel/pid_namespace.c | 2 +-
> > > kernel/sys.c | 13 +-
> > > security/selinux/include/classmap.h | 5 +-
> > > tools/testing/selftests/clone3/.gitignore | 1 +
> > > tools/testing/selftests/clone3/Makefile | 4 +-
> > > .../clone3/clone3_cap_checkpoint_restore.c | 177 ++++++++++++++++++
> > > 10 files changed, 212 insertions(+), 15 deletions(-)
> > > create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > >
> > > base-commit: d31958b30ea3b7b6e522d6bf449427748ad45822
> >
> > Adrian, Nicolas thank you!
> > I grabbed the series to run the various core test-suites we've added
> > over the last year and pushed it to
> > https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=cap_checkpoint_restore
> > for now to let kbuild/ltp chew on it for a bit.
>
> Ok, I ran the test-suite this morning and there's nothing to worry about
> it all passes _but_ the selftests had a bug using SKIP() instead of
> XFAIL() and they mixed ksft_print_msg() and TH_LOG(). I know that I
> think I mentioned to you that you can't use TH_LOG() outside of TEST*().
> Turns out I was wrong. You can do it if you pass in a specific global
> variable. Here's the diff I applied on top of the selftests you sent.
> After these changes the output looks like this:
>
> [==========] Running 1 tests from 1 test cases.
> [ RUN ] global.clone3_cap_checkpoint_restore
> # clone3() syscall supported
> clone3_cap_checkpoint_restore.c:155:clone3_cap_checkpoint_restore:Child has PID 12303
> clone3_cap_checkpoint_restore.c:88:clone3_cap_checkpoint_restore:[12302] Trying clone3() with CLONE_SET_TID to 12303
> clone3_cap_checkpoint_restore.c:55:clone3_cap_checkpoint_restore:Operation not permitted - Failed to create new process
> clone3_cap_checkpoint_restore.c:90:clone3_cap_checkpoint_restore:[12302] clone3() with CLONE_SET_TID 12303 says:-1
> clone3_cap_checkpoint_restore.c:88:clone3_cap_checkpoint_restore:[12302] Trying clone3() with CLONE_SET_TID to 12303
> clone3_cap_checkpoint_restore.c:70:clone3_cap_checkpoint_restore:I am the parent (12302). My child's pid is 12303
> clone3_cap_checkpoint_restore.c:63:clone3_cap_checkpoint_restore:I am the child, my PID is 12303 (expected 12303)
> clone3_cap_checkpoint_restore.c:90:clone3_cap_checkpoint_restore:[12302] clone3() with CLONE_SET_TID 12303 says:0
> [ OK ] global.clone3_cap_checkpoint_restore
> [==========] 1 / 1 tests passed.
> [ PASSED ]
>
> Ok with this below being applied on top of it?
>
> diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> index c0d83511cd28..9562425aa0a9 100644
> --- a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> +++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> @@ -38,7 +38,8 @@ static void child_exit(int ret)
> _exit(ret);
> }
>
> -static int call_clone3_set_tid(pid_t *set_tid, size_t set_tid_size)
> +static int call_clone3_set_tid(struct __test_metadata *_metadata,
> + pid_t *set_tid, size_t set_tid_size)
> {
> int status;
> pid_t pid = -1;
> @@ -51,7 +52,7 @@ static int call_clone3_set_tid(pid_t *set_tid, size_t set_tid_size)
>
> pid = sys_clone3(&args, sizeof(struct clone_args));
> if (pid < 0) {
> - ksft_print_msg("%s - Failed to create new process\n", strerror(errno));
> + TH_LOG("%s - Failed to create new process", strerror(errno));
> return -errno;
> }
>
> @@ -59,18 +60,17 @@ static int call_clone3_set_tid(pid_t *set_tid, size_t set_tid_size)
> int ret;
> char tmp = 0;
>
> - ksft_print_msg
> - ("I am the child, my PID is %d (expected %d)\n", getpid(), set_tid[0]);
> + TH_LOG("I am the child, my PID is %d (expected %d)", getpid(), set_tid[0]);
>
> if (set_tid[0] != getpid())
> child_exit(EXIT_FAILURE);
> child_exit(EXIT_SUCCESS);
> }
>
> - ksft_print_msg("I am the parent (%d). My child's pid is %d\n", getpid(), pid);
> + TH_LOG("I am the parent (%d). My child's pid is %d", getpid(), pid);
>
> if (waitpid(pid, &status, 0) < 0) {
> - ksft_print_msg("Child returned %s\n", strerror(errno));
> + TH_LOG("Child returned %s", strerror(errno));
> return -errno;
> }
>
> @@ -80,13 +80,14 @@ static int call_clone3_set_tid(pid_t *set_tid, size_t set_tid_size)
> return WEXITSTATUS(status);
> }
>
> -static int test_clone3_set_tid(pid_t *set_tid, size_t set_tid_size)
> +static int test_clone3_set_tid(struct __test_metadata *_metadata,
> + pid_t *set_tid, size_t set_tid_size)
> {
> int ret;
>
> - ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d\n", getpid(), set_tid[0]);
> - ret = call_clone3_set_tid(set_tid, set_tid_size);
> - ksft_print_msg("[%d] clone3() with CLONE_SET_TID %d says:%d\n", getpid(), set_tid[0], ret);
> + TH_LOG("[%d] Trying clone3() with CLONE_SET_TID to %d", getpid(), set_tid[0]);
> + ret = call_clone3_set_tid(_metadata, set_tid, set_tid_size);
> + TH_LOG("[%d] clone3() with CLONE_SET_TID %d says:%d", getpid(), set_tid[0], ret);
> return ret;
> }
>
> @@ -144,7 +145,7 @@ TEST(clone3_cap_checkpoint_restore)
> test_clone3_supported();
>
> EXPECT_EQ(getuid(), 0)
> - SKIP(return, "Skipping all tests as non-root\n");
> + XFAIL(return, "Skipping all tests as non-root\n");
>
> memset(&set_tid, 0, sizeof(set_tid));
>
> @@ -162,16 +163,20 @@ TEST(clone3_cap_checkpoint_restore)
>
> ASSERT_EQ(set_capability(), 0)
> TH_LOG("Could not set CAP_CHECKPOINT_RESTORE");
> - prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0);
> - setgid(1000);
> - setuid(1000);
> +
> + ASSERT_EQ(prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0), 0);
> +
> + EXPECT_EQ(setgid(65534), 0)
> + TH_LOG("Failed to setgid(65534)");
> + ASSERT_EQ(setuid(65534), 0);
> +
> set_tid[0] = pid;
> /* This would fail without CAP_CHECKPOINT_RESTORE */
> - ASSERT_EQ(test_clone3_set_tid(set_tid, 1), -EPERM);
> + ASSERT_EQ(test_clone3_set_tid(_metadata, set_tid, 1), -EPERM);
> ASSERT_EQ(set_capability(), 0)
> TH_LOG("Could not set CAP_CHECKPOINT_RESTORE");
> /* This should work as we have CAP_CHECKPOINT_RESTORE as non-root */
> - ASSERT_EQ(test_clone3_set_tid(set_tid, 1), 0);
> + ASSERT_EQ(test_clone3_set_tid(_metadata, set_tid, 1), 0);
> }
>

Thanks for the changes. That looks much better.

Can you fix the test directly or do you need a new reworked patch from us?

Adrian

2020-07-20 13:01:34

by Christian Brauner

[permalink] [raw]

Subject: Re: [PATCH v6 0/7] capabilities: Introduce CAP_CHECKPOINT_RESTORE

On Mon, Jul 20, 2020 at 02:46:37PM +0200, Adrian Reber wrote:
> On Mon, Jul 20, 2020 at 01:54:52PM +0200, Christian Brauner wrote:
> > On Sun, Jul 19, 2020 at 08:17:30PM +0200, Christian Brauner wrote:
> > > On Sun, Jul 19, 2020 at 12:04:10PM +0200, Adrian Reber wrote:
> > > > This is v6 of the 'Introduce CAP_CHECKPOINT_RESTORE' patchset. The
> > > > changes to v5 are:
> > > >
> > > > * split patch dealing with /proc/self/exe into two patches:
> > > > * first patch to enable changing it with CAP_CHECKPOINT_RESTORE
> > > > and detailed history in the commit message
> > > > * second patch changes -EINVAL to -EPERM
> > > > * use kselftest_harness.h infrastructure for test
> > > > * replace if (!capable(CAP_SYS_ADMIN) || !capable(CAP_CHECKPOINT_RESTORE))
> > > > with if (!checkpoint_restore_ns_capable(&init_user_ns))
> > > >
> > > > Adrian Reber (5):
> > > > capabilities: Introduce CAP_CHECKPOINT_RESTORE
> > > > pid: use checkpoint_restore_ns_capable() for set_tid
> > > > pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
> > > > proc: allow access in init userns for map_files with
> > > > CAP_CHECKPOINT_RESTORE
> > > > selftests: add clone3() CAP_CHECKPOINT_RESTORE test
> > > >
> > > > Nicolas Viennot (2):
> > > > prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
> > > > prctl: exe link permission error changed from -EINVAL to -EPERM
> > > >
> > > > fs/proc/base.c | 8 +-
> > > > include/linux/capability.h | 6 +
> > > > include/uapi/linux/capability.h | 9 +-
> > > > kernel/pid.c | 2 +-
> > > > kernel/pid_namespace.c | 2 +-
> > > > kernel/sys.c | 13 +-
> > > > security/selinux/include/classmap.h | 5 +-
> > > > tools/testing/selftests/clone3/.gitignore | 1 +
> > > > tools/testing/selftests/clone3/Makefile | 4 +-
> > > > .../clone3/clone3_cap_checkpoint_restore.c | 177 ++++++++++++++++++
> > > > 10 files changed, 212 insertions(+), 15 deletions(-)
> > > > create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > > >
> > > > base-commit: d31958b30ea3b7b6e522d6bf449427748ad45822
> > >
> > > Adrian, Nicolas thank you!
> > > I grabbed the series to run the various core test-suites we've added
> > > over the last year and pushed it to
> > > https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=cap_checkpoint_restore
> > > for now to let kbuild/ltp chew on it for a bit.
> >
> > Ok, I ran the test-suite this morning and there's nothing to worry about
> > it all passes _but_ the selftests had a bug using SKIP() instead of
> > XFAIL() and they mixed ksft_print_msg() and TH_LOG(). I know that I
> > think I mentioned to you that you can't use TH_LOG() outside of TEST*().
> > Turns out I was wrong. You can do it if you pass in a specific global
> > variable. Here's the diff I applied on top of the selftests you sent.
> > After these changes the output looks like this:
> >
> > [==========] Running 1 tests from 1 test cases.
> > [ RUN ] global.clone3_cap_checkpoint_restore
> > # clone3() syscall supported
> > clone3_cap_checkpoint_restore.c:155:clone3_cap_checkpoint_restore:Child has PID 12303
> > clone3_cap_checkpoint_restore.c:88:clone3_cap_checkpoint_restore:[12302] Trying clone3() with CLONE_SET_TID to 12303
> > clone3_cap_checkpoint_restore.c:55:clone3_cap_checkpoint_restore:Operation not permitted - Failed to create new process
> > clone3_cap_checkpoint_restore.c:90:clone3_cap_checkpoint_restore:[12302] clone3() with CLONE_SET_TID 12303 says:-1
> > clone3_cap_checkpoint_restore.c:88:clone3_cap_checkpoint_restore:[12302] Trying clone3() with CLONE_SET_TID to 12303
> > clone3_cap_checkpoint_restore.c:70:clone3_cap_checkpoint_restore:I am the parent (12302). My child's pid is 12303
> > clone3_cap_checkpoint_restore.c:63:clone3_cap_checkpoint_restore:I am the child, my PID is 12303 (expected 12303)
> > clone3_cap_checkpoint_restore.c:90:clone3_cap_checkpoint_restore:[12302] clone3() with CLONE_SET_TID 12303 says:0
> > [ OK ] global.clone3_cap_checkpoint_restore
> > [==========] 1 / 1 tests passed.
> > [ PASSED ]
> >
> > Ok with this below being applied on top of it?
> >
> > diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > index c0d83511cd28..9562425aa0a9 100644
> > --- a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > +++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > @@ -38,7 +38,8 @@ static void child_exit(int ret)
> > _exit(ret);
> > }
> >
> > -static int call_clone3_set_tid(pid_t *set_tid, size_t set_tid_size)
> > +static int call_clone3_set_tid(struct __test_metadata *_metadata,
> > + pid_t *set_tid, size_t set_tid_size)
> > {
> > int status;
> > pid_t pid = -1;
> > @@ -51,7 +52,7 @@ static int call_clone3_set_tid(pid_t *set_tid, size_t set_tid_size)
> >
> > pid = sys_clone3(&args, sizeof(struct clone_args));
> > if (pid < 0) {
> > - ksft_print_msg("%s - Failed to create new process\n", strerror(errno));
> > + TH_LOG("%s - Failed to create new process", strerror(errno));
> > return -errno;
> > }
> >
> > @@ -59,18 +60,17 @@ static int call_clone3_set_tid(pid_t *set_tid, size_t set_tid_size)
> > int ret;
> > char tmp = 0;
> >
> > - ksft_print_msg
> > - ("I am the child, my PID is %d (expected %d)\n", getpid(), set_tid[0]);
> > + TH_LOG("I am the child, my PID is %d (expected %d)", getpid(), set_tid[0]);
> >
> > if (set_tid[0] != getpid())
> > child_exit(EXIT_FAILURE);
> > child_exit(EXIT_SUCCESS);
> > }
> >
> > - ksft_print_msg("I am the parent (%d). My child's pid is %d\n", getpid(), pid);
> > + TH_LOG("I am the parent (%d). My child's pid is %d", getpid(), pid);
> >
> > if (waitpid(pid, &status, 0) < 0) {
> > - ksft_print_msg("Child returned %s\n", strerror(errno));
> > + TH_LOG("Child returned %s", strerror(errno));
> > return -errno;
> > }
> >
> > @@ -80,13 +80,14 @@ static int call_clone3_set_tid(pid_t *set_tid, size_t set_tid_size)
> > return WEXITSTATUS(status);
> > }
> >
> > -static int test_clone3_set_tid(pid_t *set_tid, size_t set_tid_size)
> > +static int test_clone3_set_tid(struct __test_metadata *_metadata,
> > + pid_t *set_tid, size_t set_tid_size)
> > {
> > int ret;
> >
> > - ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d\n", getpid(), set_tid[0]);
> > - ret = call_clone3_set_tid(set_tid, set_tid_size);
> > - ksft_print_msg("[%d] clone3() with CLONE_SET_TID %d says:%d\n", getpid(), set_tid[0], ret);
> > + TH_LOG("[%d] Trying clone3() with CLONE_SET_TID to %d", getpid(), set_tid[0]);
> > + ret = call_clone3_set_tid(_metadata, set_tid, set_tid_size);
> > + TH_LOG("[%d] clone3() with CLONE_SET_TID %d says:%d", getpid(), set_tid[0], ret);
> > return ret;
> > }
> >
> > @@ -144,7 +145,7 @@ TEST(clone3_cap_checkpoint_restore)
> > test_clone3_supported();
> >
> > EXPECT_EQ(getuid(), 0)
> > - SKIP(return, "Skipping all tests as non-root\n");
> > + XFAIL(return, "Skipping all tests as non-root\n");
> >
> > memset(&set_tid, 0, sizeof(set_tid));
> >
> > @@ -162,16 +163,20 @@ TEST(clone3_cap_checkpoint_restore)
> >
> > ASSERT_EQ(set_capability(), 0)
> > TH_LOG("Could not set CAP_CHECKPOINT_RESTORE");
> > - prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0);
> > - setgid(1000);
> > - setuid(1000);
> > +
> > + ASSERT_EQ(prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0), 0);
> > +
> > + EXPECT_EQ(setgid(65534), 0)
> > + TH_LOG("Failed to setgid(65534)");
> > + ASSERT_EQ(setuid(65534), 0);
> > +
> > set_tid[0] = pid;
> > /* This would fail without CAP_CHECKPOINT_RESTORE */
> > - ASSERT_EQ(test_clone3_set_tid(set_tid, 1), -EPERM);
> > + ASSERT_EQ(test_clone3_set_tid(_metadata, set_tid, 1), -EPERM);
> > ASSERT_EQ(set_capability(), 0)
> > TH_LOG("Could not set CAP_CHECKPOINT_RESTORE");
> > /* This should work as we have CAP_CHECKPOINT_RESTORE as non-root */
> > - ASSERT_EQ(test_clone3_set_tid(set_tid, 1), 0);
> > + ASSERT_EQ(test_clone3_set_tid(_metadata, set_tid, 1), 0);
> > }
> >
>
> Thanks for the changes. That looks much better.
>
> Can you fix the test directly or do you need a new reworked patch from us?

No, I squashed this into your commit, added a comment about my changes,
signed it off and am going to push it out for some more testing.

Thanks!
Christian