2019-11-14 07:10:04

by Adrian Reber

[permalink] [raw]
Subject: [PATCH v9 1/2] fork: extend clone3() to support setting a PID

The main motivation to add set_tid to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.

Extending clone3() to support *set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).

This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with *set_tid as they are currently in place for ns_last_pid.

The original version of this change was using a single value for
set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
decided to change set_tid to an array to enable setting the PID of a
process in multiple PID namespaces at the same time. If a process is
created in a PID namespace it is possible to influence the PID inside
and outside of the PID namespace. Details also in the corresponding
selftest.

To create a process with the following PIDs:

PID NS level Requested PID
0 (host) 31496
1 42
2 1

For that example the two newly introduced parameters to struct
clone_args (set_tid and set_tid_size) would need to be:

set_tid[0] = 1;
set_tid[1] = 42;
set_tid[2] = 31496;
set_tid_size = 3;

If only the PIDs of the two innermost nested PID namespaces should be
defined it would look like this:

set_tid[0] = 1;
set_tid[1] = 42;
set_tid_size = 2;

The PID of the newly created process would then be the next available
free PID in the PID namespace level 0 (host) and 42 in the PID namespace
at level 1 and the PID of the process in the innermost PID namespace
would be 1.

The set_tid array is used to specify the PID of a process starting
from the innermost nested PID namespaces up to set_tid_size PID namespaces.

set_tid_size cannot be larger then the current PID namespace level.

Signed-off-by: Adrian Reber <[email protected]>
---
v2:
- Removed (size < sizeof(struct clone_args)) as discussed with
Christian and Dmitry
- Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg)
- Use idr_alloc() instead of idr_alloc_cyclic() (Oleg)

v3:
- Return EEXIST if PID is already in use (Christian)
- Drop CLONE_SET_TID (Christian and Oleg)
- Use idr_is_empty() instead of idr_get_cursor() (Oleg)
- Handle different `struct clone_args` sizes (Dmitry)

v4:
- Rework struct size check with defines (Christian)
- Reduce number of set_tid checks (Oleg)
- Less parentheses and more robust code (Oleg)
- Do ns_capable() on correct user_ns (Oleg, Christian)

v5:
- make set_tid checks earlier in alloc_pid() (Christian)
- remove unnecessary comment and struct size check (Christian)

v6:
- remove CLONE_SET_TID from description (Christian)
- add clone3() tests for different clone_args sizes (Christian)
- move more set_tid checks to alloc_pid() (Oleg)
- make all set_tid checks lockless (Oleg)

v7:
- changed set_tid to be an array to set the PID of a process
in multiple nested PID namespaces at the same time as discussed
at LPC 2019 (container MC)

v8:
- skip unnecessary memset() (Rasmus)
- replace set_tid copy loop with a single copy (Christian)
- more parameter error checking (Christian)
- cache set_tid in alloc_pid() (Oleg)
- move code in "else" branch (Oleg)

v9:
- added kernel-doc to include/uapi/linux/sched.h (Christian)
- moved a variable to limit its scope; keep all set_tid_size
related changes in one place (Oleg)
---
include/linux/pid.h | 3 +-
include/linux/pid_namespace.h | 2 +
include/linux/sched/task.h | 3 ++
include/uapi/linux/sched.h | 52 +++++++++++++++++---------
kernel/fork.c | 24 +++++++++++-
kernel/pid.c | 69 +++++++++++++++++++++++++++--------
kernel/pid_namespace.c | 2 -
7 files changed, 118 insertions(+), 37 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 034e3cd60dc0..998ae7d24450 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -124,7 +124,8 @@ extern struct pid *find_vpid(int nr);
extern struct pid *find_get_pid(int nr);
extern struct pid *find_ge_pid(int nr, struct pid_namespace *);

-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
+ size_t set_tid_size);
extern void free_pid(struct pid *pid);
extern void disable_pid_allocation(struct pid_namespace *ns);

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 49538b172483..2ed6af88794b 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -12,6 +12,8 @@
#include <linux/ns_common.h>
#include <linux/idr.h>

+/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
+#define MAX_PID_NS_LEVEL 32

struct fs_pin;

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 4b1c3b664f51..f1879884238e 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -26,6 +26,9 @@ struct kernel_clone_args {
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
+ pid_t *set_tid;
+ /* Number of elements in *set_tid */
+ size_t set_tid_size;
};

/*
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 1d500ed03c63..9a8e222eafcd 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -39,24 +39,38 @@
#ifndef __ASSEMBLY__
/**
* struct clone_args - arguments for the clone3 syscall
- * @flags: Flags for the new process as listed above.
- * All flags are valid except for CSIGNAL and
- * CLONE_DETACHED.
- * @pidfd: If CLONE_PIDFD is set, a pidfd will be
- * returned in this argument.
- * @child_tid: If CLONE_CHILD_SETTID is set, the TID of the
- * child process will be returned in the child's
- * memory.
- * @parent_tid: If CLONE_PARENT_SETTID is set, the TID of
- * the child process will be returned in the
- * parent's memory.
- * @exit_signal: The exit_signal the parent process will be
- * sent when the child exits.
- * @stack: Specify the location of the stack for the
- * child process.
- * @stack_size: The size of the stack for the child process.
- * @tls: If CLONE_SETTLS is set, the tls descriptor
- * is set to tls.
+ * @flags: Flags for the new process as listed above.
+ * All flags are valid except for CSIGNAL and
+ * CLONE_DETACHED.
+ * @pidfd: If CLONE_PIDFD is set, a pidfd will be
+ * returned in this argument.
+ * @child_tid: If CLONE_CHILD_SETTID is set, the TID of the
+ * child process will be returned in the child's
+ * memory.
+ * @parent_tid: If CLONE_PARENT_SETTID is set, the TID of
+ * the child process will be returned in the
+ * parent's memory.
+ * @exit_signal: The exit_signal the parent process will be
+ * sent when the child exits.
+ * @stack: Specify the location of the stack for the
+ * child process.
+ * @stack_size: The size of the stack for the child process.
+ * @tls: If CLONE_SETTLS is set, the tls descriptor
+ * is set to tls.
+ * @set_tid: Pointer to an array of type *pid_t. The size
+ * of the array is defined using @set_tid_size.
+ * This array is used select PIDs/TIDs for newly
+ * created processes. The first element in this
+ * defines the PID in the most nested PID
+ * namespace. Each additional element in the array
+ * defines the PID in the parent PID namespace of
+ * the original PID namespace. If the array has
+ * less entries than the number of currently
+ * nested PID namespaces only the PIDs in the
+ * corresponding namespaces are set.
+ * @set_tid_size: This defines the size of the array referenced
+ * in @set_tid. This cannot be larger than the
+ * kernel's limit of nested PID namespaces.
*
* The structure is versioned by size and thus extensible.
* New struct members must go at the end of the struct and
@@ -71,6 +85,8 @@ struct clone_args {
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
+ __aligned_u64 set_tid;
+ __aligned_u64 set_tid_size;
};
#endif

diff --git a/kernel/fork.c b/kernel/fork.c
index 954e875e72b1..417570263f1f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2087,7 +2087,8 @@ static __latent_entropy struct task_struct *copy_process(
stackleak_task_init(p);

if (pid != &init_struct_pid) {
- pid = alloc_pid(p->nsproxy->pid_ns_for_children);
+ pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid,
+ args->set_tid_size);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_thread;
@@ -2590,6 +2591,7 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
{
int err;
struct clone_args args;
+ pid_t *kset_tid = kargs->set_tid;

if (unlikely(usize > PAGE_SIZE))
return -E2BIG;
@@ -2600,6 +2602,15 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
if (err)
return err;

+ if (unlikely(args.set_tid_size > MAX_PID_NS_LEVEL))
+ return -EINVAL;
+
+ if (unlikely(!args.set_tid && args.set_tid_size > 0))
+ return -EINVAL;
+
+ if (unlikely(args.set_tid && args.set_tid_size == 0))
+ return -EINVAL;
+
/*
* Verify that higher 32bits of exit_signal are unset and that
* it is a valid signal
@@ -2617,8 +2628,16 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
.stack = args.stack,
.stack_size = args.stack_size,
.tls = args.tls,
+ .set_tid_size = args.set_tid_size,
};

+ if (args.set_tid &&
+ copy_from_user(kset_tid, u64_to_user_ptr(args.set_tid),
+ (kargs->set_tid_size * sizeof(pid_t))))
+ return -EFAULT;
+
+ kargs->set_tid = kset_tid;
+
return 0;
}

@@ -2662,6 +2681,9 @@ SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
int err;

struct kernel_clone_args kargs;
+ pid_t set_tid[MAX_PID_NS_LEVEL];
+
+ kargs.set_tid = set_tid;

err = copy_clone_args_from_user(&kargs, uargs, size);
if (err)
diff --git a/kernel/pid.c b/kernel/pid.c
index 7b5f6c963d72..eb32668997c6 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -157,7 +157,8 @@ void free_pid(struct pid *pid)
call_rcu(&pid->rcu, delayed_put_pid);
}

-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
+ size_t set_tid_size)
{
struct pid *pid;
enum pid_type type;
@@ -166,6 +167,17 @@ struct pid *alloc_pid(struct pid_namespace *ns)
struct upid *upid;
int retval = -ENOMEM;

+ /*
+ * set_tid_size contains the size of the set_tid array. Starting at
+ * the most nested currently active PID namespace it tells alloc_pid()
+ * which PID to set for a process in that most nested PID namespace
+ * up to set_tid_size PID namespaces. It does not have to set the PID
+ * for a process in all nested PID namespaces but set_tid_size must
+ * never be greater than the current ns->level + 1.
+ */
+ if (set_tid_size > ns->level + 1)
+ return ERR_PTR(-EINVAL);
+
pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
if (!pid)
return ERR_PTR(retval);
@@ -174,24 +186,51 @@ struct pid *alloc_pid(struct pid_namespace *ns)
pid->level = ns->level;

for (i = ns->level; i >= 0; i--) {
- int pid_min = 1;
+ int tid = 0;
+
+ if (set_tid_size) {
+ tid = set_tid[ns->level - i];
+ if (tid < 1 || tid >= pid_max)
+ return ERR_PTR(-EINVAL);
+ /*
+ * Also fail if a PID != 1 is requested and
+ * no PID 1 exists.
+ */
+ if (tid != 1 && !tmp->child_reaper)
+ return ERR_PTR(-EINVAL);
+ if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
+ return ERR_PTR(-EPERM);
+ set_tid_size--;
+ }

idr_preload(GFP_KERNEL);
spin_lock_irq(&pidmap_lock);

- /*
- * init really needs pid 1, but after reaching the maximum
- * wrap back to RESERVED_PIDS
- */
- if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
- pid_min = RESERVED_PIDS;
-
- /*
- * Store a null pointer so find_pid_ns does not find
- * a partially initialized PID (see below).
- */
- nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
- pid_max, GFP_ATOMIC);
+ if (tid) {
+ nr = idr_alloc(&tmp->idr, NULL, tid,
+ tid + 1, GFP_ATOMIC);
+ /*
+ * If ENOSPC is returned it means that the PID is
+ * alreay in use. Return EEXIST in that case.
+ */
+ if (nr == -ENOSPC)
+ nr = -EEXIST;
+ } else {
+ int pid_min = 1;
+ /*
+ * init really needs pid 1, but after reaching the
+ * maximum wrap back to RESERVED_PIDS
+ */
+ if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
+ pid_min = RESERVED_PIDS;
+
+ /*
+ * Store a null pointer so find_pid_ns does not find
+ * a partially initialized PID (see below).
+ */
+ nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
+ pid_max, GFP_ATOMIC);
+ }
spin_unlock_irq(&pidmap_lock);
idr_preload_end();

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index a6a79f85c81a..d40017e79ebe 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -26,8 +26,6 @@

static DEFINE_MUTEX(pid_caches_mutex);
static struct kmem_cache *pid_ns_cachep;
-/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
-#define MAX_PID_NS_LEVEL 32
/* Write once array, filled from the beginning. */
static struct kmem_cache *pid_cache[MAX_PID_NS_LEVEL];


base-commit: 7acdfe534e726450fe8051abc2f36380fb2c2c0e
--
2.23.0


2019-11-14 07:10:49

by Adrian Reber

[permalink] [raw]
Subject: [PATCH v9 2/2] selftests: add tests for clone3() with *set_tid

This tests clone3() with *set_tid to see if all desired PIDs are working
as expected. The tests are trying multiple invalid input parameters as
well as creating processes while specifying a certain PID in multiple
PID namespaces at the same time.

Additionally this moves common clone3() test code into clone3_selftests.h.

Signed-off-by: Adrian Reber <[email protected]>
---
v9:
- applied all changes from Christian's review (except using the
NSpid: parsing code from selftests/pidfd/pidfd_fdinfo_test.c)
---
tools/testing/selftests/clone3/.gitignore | 1 +
tools/testing/selftests/clone3/Makefile | 2 +-
tools/testing/selftests/clone3/clone3.c | 8 +-
.../selftests/clone3/clone3_clear_sighand.c | 20 +-
.../selftests/clone3/clone3_selftests.h | 29 ++
.../testing/selftests/clone3/clone3_set_tid.c | 355 ++++++++++++++++++
6 files changed, 389 insertions(+), 26 deletions(-)
create mode 100644 tools/testing/selftests/clone3/clone3_selftests.h
create mode 100644 tools/testing/selftests/clone3/clone3_set_tid.c

diff --git a/tools/testing/selftests/clone3/.gitignore b/tools/testing/selftests/clone3/.gitignore
index 2a30ae18b06e..0dc4f32c6cb8 100644
--- a/tools/testing/selftests/clone3/.gitignore
+++ b/tools/testing/selftests/clone3/.gitignore
@@ -1,2 +1,3 @@
clone3
clone3_clear_sighand
+clone3_set_tid
diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile
index eb26eb793c80..cf976c732906 100644
--- a/tools/testing/selftests/clone3/Makefile
+++ b/tools/testing/selftests/clone3/Makefile
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0
CFLAGS += -g -I../../../../usr/include/

-TEST_GEN_PROGS := clone3 clone3_clear_sighand
+TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid

include ../lib.mk
diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index 0f8a9ef40117..4669b3d418e7 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -18,6 +18,7 @@
#include <sched.h>

#include "../kselftest.h"
+#include "clone3_selftests.h"

/*
* Different sizes of struct clone_args
@@ -35,11 +36,6 @@ enum test_mode {
CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG,
};

-static pid_t raw_clone(struct clone_args *args, size_t size)
-{
- return syscall(__NR_clone3, args, size);
-}
-
static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
{
struct clone_args args = {
@@ -83,7 +79,7 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)

memcpy(&args_ext.args, &args, sizeof(struct clone_args));

- pid = raw_clone((struct clone_args *)&args_ext, size);
+ pid = sys_clone3((struct clone_args *)&args_ext, size);
if (pid < 0) {
ksft_print_msg("%s - Failed to create new process\n",
strerror(errno));
diff --git a/tools/testing/selftests/clone3/clone3_clear_sighand.c b/tools/testing/selftests/clone3/clone3_clear_sighand.c
index 0d957be1bdc5..456783ad19d6 100644
--- a/tools/testing/selftests/clone3/clone3_clear_sighand.c
+++ b/tools/testing/selftests/clone3/clone3_clear_sighand.c
@@ -14,30 +14,12 @@
#include <sys/wait.h>

#include "../kselftest.h"
+#include "clone3_selftests.h"

#ifndef CLONE_CLEAR_SIGHAND
#define CLONE_CLEAR_SIGHAND 0x100000000ULL
#endif

-#ifndef __NR_clone3
-#define __NR_clone3 -1
-struct clone_args {
- __aligned_u64 flags;
- __aligned_u64 pidfd;
- __aligned_u64 child_tid;
- __aligned_u64 parent_tid;
- __aligned_u64 exit_signal;
- __aligned_u64 stack;
- __aligned_u64 stack_size;
- __aligned_u64 tls;
-};
-#endif
-
-static pid_t sys_clone3(struct clone_args *args, size_t size)
-{
- return syscall(__NR_clone3, args, size);
-}
-
static void test_clone3_supported(void)
{
pid_t pid;
diff --git a/tools/testing/selftests/clone3/clone3_selftests.h b/tools/testing/selftests/clone3/clone3_selftests.h
new file mode 100644
index 000000000000..793e719d781d
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3_selftests.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _CLONE3_SELFTESTS_H
+#define _CLONE3_SELFTESTS_H
+
+#include <stdint.h>
+
+#define ptr_to_u64(ptr) ((__u64)((uintptr_t)(ptr)))
+
+#ifndef __NR_clone3
+#define __NR_clone3 -1
+struct clone_args {
+ __aligned_u64 flags;
+ __aligned_u64 pidfd;
+ __aligned_u64 child_tid;
+ __aligned_u64 parent_tid;
+ __aligned_u64 exit_signal;
+ __aligned_u64 stack;
+ __aligned_u64 stack_size;
+ __aligned_u64 tls;
+};
+#endif
+
+static pid_t sys_clone3(struct clone_args *args, size_t size)
+{
+ return syscall(__NR_clone3, args, size);
+}
+
+#endif /* _CLONE3_SELFTESTS_H */
diff --git a/tools/testing/selftests/clone3/clone3_set_tid.c b/tools/testing/selftests/clone3/clone3_set_tid.c
new file mode 100644
index 000000000000..e73d01aea5db
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3_set_tid.c
@@ -0,0 +1,355 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Based on Christian Brauner's clone3() example.
+ * These tests are assuming to be running in the host's
+ * PID namespace.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <linux/types.h>
+#include <linux/sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/un.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <sched.h>
+
+#include "../kselftest.h"
+#include "clone3_selftests.h"
+
+#ifndef MAX_PID_NS_LEVEL
+#define MAX_PID_NS_LEVEL 32
+#endif
+
+static int pipe_1[2];
+static int pipe_2[2];
+
+static int call_clone3_set_tid(pid_t *set_tid,
+ size_t set_tid_size,
+ int flags,
+ int expected_pid,
+ bool wait_for_it)
+{
+ int status;
+ int ret = 0;
+ pid_t pid = -1;
+
+ struct clone_args args = {
+ .flags = flags,
+ .exit_signal = SIGCHLD,
+ .set_tid = ptr_to_u64(set_tid),
+ .set_tid_size = set_tid_size,
+ };
+
+ pid = sys_clone3(&args, sizeof(struct clone_args));
+ if (pid < 0) {
+ ksft_print_msg("%s - Failed to create new process\n",
+ strerror(errno));
+ return -errno;
+ }
+
+ if (pid == 0) {
+ char tmp = 0;
+
+ ksft_print_msg("I am the child, my PID is %d (expected %d)\n",
+ getpid(), set_tid[0]);
+ if (wait_for_it) {
+ ksft_print_msg("[%d] Child is ready and waiting\n",
+ getpid());
+
+ /* Signal the parent that the child is ready */
+ close(pipe_1[0]);
+ write(pipe_1[1], &tmp, 1);
+ close(pipe_1[1]);
+ close(pipe_2[1]);
+ read(pipe_2[0], &tmp, 1);
+ close(pipe_2[0]);
+ }
+
+ if (set_tid[0] != getpid())
+ _exit(EXIT_FAILURE);
+ _exit(EXIT_SUCCESS);
+ }
+
+ if (expected_pid == 0 || expected_pid == pid) {
+ ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
+ getpid(), pid);
+ } else {
+ ksft_print_msg(
+ "Expected child pid %d does not match actual pid %d\n",
+ expected_pid, pid);
+ ret = -1;
+ }
+
+ if (wait(&status) < 0) {
+ ksft_print_msg("Child returned %s\n", strerror(errno));
+ return -errno;
+ }
+
+ if (!WIFEXITED(status))
+ return -1;
+
+ return WEXITSTATUS(status);
+}
+
+static void test_clone3_set_tid(pid_t *set_tid,
+ size_t set_tid_size,
+ int flags,
+ int expected,
+ int expected_pid,
+ bool wait_for_it)
+{
+ int ret;
+
+ ksft_print_msg(
+ "[%d] Trying clone3() with CLONE_SET_TID to %d and 0x%x\n",
+ getpid(), set_tid[0], flags);
+ ret = call_clone3_set_tid(set_tid, set_tid_size, flags, expected_pid,
+ wait_for_it);
+ ksft_print_msg(
+ "[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
+ getpid(), set_tid[0], ret, expected);
+ if (ret != expected)
+ ksft_test_result_fail(
+ "[%d] Result (%d) is different than expected (%d)\n",
+ getpid(), ret, expected);
+ else
+ ksft_test_result_pass(
+ "[%d] Result (%d) matches expectation (%d)\n",
+ getpid(), ret, expected);
+}
+int main(int argc, char *argv[])
+{
+ FILE *f;
+ char buf;
+ char *line;
+ int status;
+ int ret = -1;
+ size_t len = 0;
+ int pid_max = 0;
+ uid_t uid = getuid();
+ char proc_path[100] = {0};
+ pid_t pid, ns1, ns2, ns3, ns_pid;
+ pid_t set_tid[MAX_PID_NS_LEVEL * 2];
+
+ if (pipe(pipe_1) < 0 || pipe(pipe_2) < 0)
+ ksft_exit_fail_msg("pipe() failed\n");
+
+ ksft_print_header();
+ ksft_set_plan(26);
+
+ f = fopen("/proc/sys/kernel/pid_max", "r");
+ if (f == NULL)
+ ksft_exit_fail_msg(
+ "%s - Could not open /proc/sys/kernel/pid_max\n",
+ strerror(errno));
+ fscanf(f, "%d", &pid_max);
+ fclose(f);
+ ksft_print_msg("/proc/sys/kernel/pid_max %d\n", pid_max);
+
+ /* Try invalid settings */
+ memset(&set_tid, 0, sizeof(set_tid));
+ test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL + 1, 0, -EINVAL, 0, 0);
+
+ test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2, 0, -EINVAL, 0, 0);
+
+ test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2 + 1, 0,
+ -EINVAL, 0, 0);
+ test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 42, 0, -EINVAL, 0, 0);
+
+ /*
+ * This can actually work if this test running in a MAX_PID_NS_LEVEL - 1
+ * nested PID namespace.
+ */
+ test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL - 1, 0, -EINVAL, 0, 0);
+
+ memset(&set_tid, 0xff, sizeof(set_tid));
+ test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL + 1, 0, -EINVAL, 0, 0);
+
+ test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2, 0, -EINVAL, 0, 0);
+
+ test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2 + 1, 0,
+ -EINVAL, 0, 0);
+ test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 42, 0, -EINVAL, 0, 0);
+ /*
+ * This can actually work if this test running in a MAX_PID_NS_LEVEL - 1
+ * nested PID namespace.
+ */
+ test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL - 1, 0, -EINVAL, 0, 0);
+
+ memset(&set_tid, 0, sizeof(set_tid));
+ /* Try with an invalid PID */
+ set_tid[0] = 0;
+ test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
+ set_tid[0] = -1;
+ test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
+ /* Claim that the set_tid array actually contains 2 elements. */
+ test_clone3_set_tid(set_tid, 2, 0, -EINVAL, 0, 0);
+ /* Try it in a new PID namespace */
+ if (uid == 0)
+ test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0);
+ else
+ ksft_test_result_skip("Clone3() with set_tid requires root\n");
+
+ /*
+ * Try with a valid PID (1) but as non-root. This should fail
+ * with -EPERM if running in the initial user namespace.
+ * As root it should tell us -EEXIST.
+ */
+ set_tid[0] = 1;
+ if (uid == 0)
+ test_clone3_set_tid(set_tid, 1, 0, -EEXIST, 0, 0);
+ else
+ ksft_test_result_skip("Clone3() with set_tid requires root\n");
+
+ /* Try it in a new PID namespace */
+ if (uid == 0)
+ test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, 0, 0, 0);
+ else
+ ksft_test_result_skip("Clone3() with set_tid requires root\n");
+
+ /* pid_max should fail everywhere */
+ set_tid[0] = pid_max;
+ test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
+ if (uid == 0)
+ test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0);
+ else
+ ksft_test_result_skip("Clone3() with set_tid requires root\n");
+
+ if (uid != 0) {
+ /*
+ * All remaining tests require root. Tell the framework
+ * that all those tests are skipped as non-root.
+ */
+ ksft_cnt.ksft_xskip += ksft_plan - ksft_test_num();
+ goto out;
+ }
+
+ /* Find the current active PID */
+ pid = fork();
+ if (pid == 0) {
+ ksft_print_msg("Child has PID %d\n", getpid());
+ _exit(EXIT_SUCCESS);
+ }
+ (void)wait(NULL);
+
+ /* After the child has finished, its PID should be free. */
+ set_tid[0] = pid;
+ test_clone3_set_tid(set_tid, 1, 0, 0, 0, 0);
+
+ /* This should fail as there is no PID 1 in that namespace */
+ test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0);
+
+ /*
+ * Creating a process with PID 1 in the newly created most nested
+ * PID namespace and PID 'pid' in the parent PID namespace. This
+ * needs to work.
+ */
+ set_tid[0] = 1;
+ set_tid[1] = pid;
+ test_clone3_set_tid(set_tid, 2, CLONE_NEWPID, 0, pid, 0);
+
+ ksft_print_msg("unshare PID namespace\n");
+ if (unshare(CLONE_NEWPID) == -1)
+ ksft_exit_fail_msg("unshare(CLONE_NEWPID) failed: %s\n",
+ strerror(errno));
+
+ set_tid[0] = pid;
+
+ /* This should fail as there is no PID 1 in that namespace */
+ test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
+
+ /* Let's create a PID 1 */
+ ns_pid = fork();
+ if (ns_pid == 0) {
+ ksft_print_msg("Child in PID namespace has PID %d\n", getpid());
+ set_tid[0] = 2;
+ test_clone3_set_tid(set_tid, 1, 0, 0, 2, 0);
+ set_tid[0] = 1;
+ set_tid[1] = 42;
+ set_tid[2] = pid;
+ /*
+ * This should fail as there are not enough active PID
+ * namespaces. Again assuming this is running in the host's
+ * PID namespace. Not yet nested.
+ */
+ test_clone3_set_tid(set_tid, 4, CLONE_NEWPID, -EINVAL, 0, 0);
+
+ /*
+ * This should work and from the parent we should see
+ * something like 'NSpid: pid 42 1'.
+ */
+ test_clone3_set_tid(set_tid, 3, CLONE_NEWPID, 0, 42, 1);
+
+ _exit(ksft_cnt.ksft_pass);
+ }
+
+ close(pipe_1[1]);
+ close(pipe_2[0]);
+ while (read(pipe_1[0], &buf, 1) > 0) {
+ ksft_print_msg("[%d] Child is ready and waiting\n", getpid());
+ break;
+ }
+
+ snprintf(proc_path, sizeof(proc_path), "/proc/%d/status", pid);
+ f = fopen(proc_path, "r");
+ if (f == NULL)
+ ksft_exit_fail_msg(
+ "%s - Could not open %s\n",
+ strerror(errno), proc_path);
+
+ while (getline(&line, &len, f) != -1) {
+ if (strstr(line, "NSpid")) {
+ int i;
+
+ /* Verify that all generated PIDs are as expected. */
+ i = sscanf(line, "NSpid:\t%d\t%d\t%d",
+ &ns3, &ns2, &ns1);
+ if (i != 3)
+ ns1 = ns2 = ns3 = 0;
+ break;
+ }
+ }
+ fclose(f);
+ free(line);
+ close(pipe_2[0]);
+
+ /* Tell the clone3()'d child to finish. */
+ write(pipe_2[1], &buf, 1);
+ close(pipe_2[1]);
+
+ if (wait(&status) < 0) {
+ ksft_print_msg("Child returned %s\n", strerror(errno));
+ ret = -errno;
+ goto out;
+ }
+
+ if (!WIFEXITED(status))
+ ksft_test_result_fail("Child error\n");
+
+ if (WEXITSTATUS(status))
+ /*
+ * Update the number of total tests with the tests from the
+ * child processes.
+ */
+ ksft_cnt.ksft_pass = WEXITSTATUS(status);
+
+ if (ns3 == pid && ns2 == 42 && ns1 == 1)
+ ksft_test_result_pass(
+ "PIDs in all namespaces as expected (%d,%d,%d)\n",
+ ns3, ns2, ns1);
+ else
+ ksft_test_result_fail(
+ "PIDs in all namespaces not as expected (%d,%d,%d)\n",
+ ns3, ns2, ns1);
+out:
+ ret = 0;
+
+ return !ret ? ksft_exit_pass() : ksft_exit_fail();
+}
--
2.23.0

2019-11-14 08:00:09

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v9 1/2] fork: extend clone3() to support setting a PID

On Thu, Nov 14, 2019 at 08:07:08AM +0100, Adrian Reber wrote:
> The main motivation to add set_tid to clone3() is CRIU.
>
> To restore a process with the same PID/TID CRIU currently uses
> /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
> ns_last_pid and then (quickly) does a clone(). This works most of the
> time, but it is racy. It is also slow as it requires multiple syscalls.
>
> Extending clone3() to support *set_tid makes it possible restore a
> process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
> race free (as long as the desired PID/TID is available).
>
> This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
> on clone3() with *set_tid as they are currently in place for ns_last_pid.
>
> The original version of this change was using a single value for
> set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
> decided to change set_tid to an array to enable setting the PID of a
> process in multiple PID namespaces at the same time. If a process is
> created in a PID namespace it is possible to influence the PID inside
> and outside of the PID namespace. Details also in the corresponding
> selftest.
>
> To create a process with the following PIDs:
>
> PID NS level Requested PID
> 0 (host) 31496
> 1 42
> 2 1
>
> For that example the two newly introduced parameters to struct
> clone_args (set_tid and set_tid_size) would need to be:
>
> set_tid[0] = 1;
> set_tid[1] = 42;
> set_tid[2] = 31496;
> set_tid_size = 3;
>
> If only the PIDs of the two innermost nested PID namespaces should be
> defined it would look like this:
>
> set_tid[0] = 1;
> set_tid[1] = 42;
> set_tid_size = 2;
>
> The PID of the newly created process would then be the next available
> free PID in the PID namespace level 0 (host) and 42 in the PID namespace
> at level 1 and the PID of the process in the innermost PID namespace
> would be 1.
>
> The set_tid array is used to specify the PID of a process starting
> from the innermost nested PID namespaces up to set_tid_size PID namespaces.
>
> set_tid_size cannot be larger then the current PID namespace level.
>
> Signed-off-by: Adrian Reber <[email protected]>

I have no quarrels with the core patch anymore.
Note, once Oleg has said he's fine with this patch too I will likely
reword the kernel-doc and the comment in alloc_pid() and the commit
message a little before applying; but really just minor things that are
not worth resending for.

Thanks!
Reviewed-by: Christian Brauner <[email protected]>

2019-11-14 08:04:36

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v9 1/2] fork: extend clone3() to support setting a PID

On Thu, Nov 14, 2019 at 08:57:59AM +0100, Christian Brauner wrote:
> On Thu, Nov 14, 2019 at 08:07:08AM +0100, Adrian Reber wrote:
> > The main motivation to add set_tid to clone3() is CRIU.
> >
> > To restore a process with the same PID/TID CRIU currently uses
> > /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
> > ns_last_pid and then (quickly) does a clone(). This works most of the
> > time, but it is racy. It is also slow as it requires multiple syscalls.
> >
> > Extending clone3() to support *set_tid makes it possible restore a
> > process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
> > race free (as long as the desired PID/TID is available).
> >
> > This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
> > on clone3() with *set_tid as they are currently in place for ns_last_pid.
> >
> > The original version of this change was using a single value for
> > set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
> > decided to change set_tid to an array to enable setting the PID of a
> > process in multiple PID namespaces at the same time. If a process is
> > created in a PID namespace it is possible to influence the PID inside
> > and outside of the PID namespace. Details also in the corresponding
> > selftest.
> >
> > To create a process with the following PIDs:
> >
> > PID NS level Requested PID
> > 0 (host) 31496
> > 1 42
> > 2 1
> >
> > For that example the two newly introduced parameters to struct
> > clone_args (set_tid and set_tid_size) would need to be:
> >
> > set_tid[0] = 1;
> > set_tid[1] = 42;
> > set_tid[2] = 31496;
> > set_tid_size = 3;
> >
> > If only the PIDs of the two innermost nested PID namespaces should be
> > defined it would look like this:
> >
> > set_tid[0] = 1;
> > set_tid[1] = 42;
> > set_tid_size = 2;
> >
> > The PID of the newly created process would then be the next available
> > free PID in the PID namespace level 0 (host) and 42 in the PID namespace
> > at level 1 and the PID of the process in the innermost PID namespace
> > would be 1.
> >
> > The set_tid array is used to specify the PID of a process starting
> > from the innermost nested PID namespaces up to set_tid_size PID namespaces.
> >
> > set_tid_size cannot be larger then the current PID namespace level.
> >
> > Signed-off-by: Adrian Reber <[email protected]>
>
> I have no quarrels with the core patch anymore.
> Note, once Oleg has said he's fine with this patch too I will likely
> reword the kernel-doc and the comment in alloc_pid() and the commit
> message a little before applying; but really just minor things that are
> not worth resending for.
>
> Thanks!
> Reviewed-by: Christian Brauner <[email protected]>

And thanks for the --base=<commit> information you've added to this
version. :)

Christian

2019-11-14 08:15:37

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v9 1/2] fork: extend clone3() to support setting a PID

On Thu, Nov 14, 2019 at 08:07:08AM +0100, Adrian Reber wrote:
> * The structure is versioned by size and thus extensible.
> * New struct members must go at the end of the struct and
> @@ -71,6 +85,8 @@ struct clone_args {
> __aligned_u64 stack;
> __aligned_u64 stack_size;
> __aligned_u64 tls;
> + __aligned_u64 set_tid;
> + __aligned_u64 set_tid_size;

Oh, one thing that is missing is the addition of

+#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */

in sched.h. Please add that, when you resend.

Thanks!
Christian

2019-11-14 08:25:54

by Adrian Reber

[permalink] [raw]
Subject: Re: [PATCH v9 1/2] fork: extend clone3() to support setting a PID

On Thu, Nov 14, 2019 at 09:14:00AM +0100, Christian Brauner wrote:
> On Thu, Nov 14, 2019 at 08:07:08AM +0100, Adrian Reber wrote:
> > * The structure is versioned by size and thus extensible.
> > * New struct members must go at the end of the struct and
> > @@ -71,6 +85,8 @@ struct clone_args {
> > __aligned_u64 stack;
> > __aligned_u64 stack_size;
> > __aligned_u64 tls;
> > + __aligned_u64 set_tid;
> > + __aligned_u64 set_tid_size;
>
> Oh, one thing that is missing is the addition of
>
> +#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
>
> in sched.h. Please add that, when you resend.

Even if it is unused?

Adrian

2019-11-14 08:33:50

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v9 1/2] fork: extend clone3() to support setting a PID

On Thu, Nov 14, 2019 at 09:22:09AM +0100, Adrian Reber wrote:
> On Thu, Nov 14, 2019 at 09:14:00AM +0100, Christian Brauner wrote:
> > On Thu, Nov 14, 2019 at 08:07:08AM +0100, Adrian Reber wrote:
> > > * The structure is versioned by size and thus extensible.
> > > * New struct members must go at the end of the struct and
> > > @@ -71,6 +85,8 @@ struct clone_args {
> > > __aligned_u64 stack;
> > > __aligned_u64 stack_size;
> > > __aligned_u64 tls;
> > > + __aligned_u64 set_tid;
> > > + __aligned_u64 set_tid_size;
> >
> > Oh, one thing that is missing is the addition of
> >
> > +#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
> >
> > in sched.h. Please add that, when you resend.
>
> Even if it is unused?

Yes, I think so. It's nice for userspace to have these macros for
versioning purposes. We already started exposing this versioning with
_VER0 to userspace.

Christian

2019-11-14 08:40:57

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v9 2/2] selftests: add tests for clone3() with *set_tid

On Thu, Nov 14, 2019 at 08:07:09AM +0100, Adrian Reber wrote:
> +++ b/tools/testing/selftests/clone3/clone3_selftests.h
> @@ -0,0 +1,29 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _CLONE3_SELFTESTS_H
> +#define _CLONE3_SELFTESTS_H
> +
> +#include <stdint.h>
> +
> +#define ptr_to_u64(ptr) ((__u64)((uintptr_t)(ptr)))
> +
> +#ifndef __NR_clone3
> +#define __NR_clone3 -1
> +struct clone_args {
> + __aligned_u64 flags;

Hugh, doesn't that complain about missing definition of __aligned_u64?
I thought that was declared in linux/types.h

> + __aligned_u64 pidfd;
> + __aligned_u64 child_tid;
> + __aligned_u64 parent_tid;
> + __aligned_u64 exit_signal;
> + __aligned_u64 stack;
> + __aligned_u64 stack_size;
> + __aligned_u64 tls;
> +};
> +#endif
> +
> +static pid_t sys_clone3(struct clone_args *args, size_t size)
> +{
> + return syscall(__NR_clone3, args, size);

And doesn't that one complain about missing syscall() defined in
sys/syscall.h?

I would rather make the header sm like:

#define _GNU_SOURCE
#include <sched.h>
#include <stdint.h>
#include <syscall.h>
#include <linux/types.h>

so that the order of inclusion doesn't matter. You're getting away with
this because you include syscall.h, etc. before clone3_selftests.h in
clone3_set_tid.c below. I'd rather not rely on such ordering.

/* stuff */

> +}
> +
> +#endif /* _CLONE3_SELFTESTS_H */
> diff --git a/tools/testing/selftests/clone3/clone3_set_tid.c b/tools/testing/selftests/clone3/clone3_set_tid.c
> + if (expected_pid == 0 || expected_pid == pid) {
> + ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
> + getpid(), pid);
> + } else {
> + ksft_print_msg(
> + "Expected child pid %d does not match actual pid %d\n",
> + expected_pid, pid);
> + ret = -1;

What's that for?
I think that whole ret variable can go, no?

> + }
> +
> + if (wait(&status) < 0) {
> + ksft_print_msg("Child returned %s\n", strerror(errno));
> + return -errno;
> + }
> +
> + if (!WIFEXITED(status))
> + return -1;
> +
> + return WEXITSTATUS(status);
> +}
> +
> +static void test_clone3_set_tid(pid_t *set_tid,
> + size_t set_tid_size,
> + int flags,
> + int expected,
> + int expected_pid,
> + bool wait_for_it)
> +{
> + int ret;
> +
> + ksft_print_msg(
> + "[%d] Trying clone3() with CLONE_SET_TID to %d and 0x%x\n",
> + getpid(), set_tid[0], flags);
> + ret = call_clone3_set_tid(set_tid, set_tid_size, flags, expected_pid,
> + wait_for_it);
> + ksft_print_msg(
> + "[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
> + getpid(), set_tid[0], ret, expected);
> + if (ret != expected)
> + ksft_test_result_fail(
> + "[%d] Result (%d) is different than expected (%d)\n",
> + getpid(), ret, expected);
> + else
> + ksft_test_result_pass(
> + "[%d] Result (%d) matches expectation (%d)\n",
> + getpid(), ret, expected);
> +}
> +int main(int argc, char *argv[])
> +{
> + FILE *f;
> + char buf;
> + char *line;
> + int status;
> + int ret = -1;
> + size_t len = 0;
> + int pid_max = 0;
> + uid_t uid = getuid();
> + char proc_path[100] = {0};
> + pid_t pid, ns1, ns2, ns3, ns_pid;
> + pid_t set_tid[MAX_PID_NS_LEVEL * 2];

reverse-reverse Christmas tree ordering I see :)

> +
> + if (pipe(pipe_1) < 0 || pipe(pipe_2) < 0)
> + ksft_exit_fail_msg("pipe() failed\n");
> +
> + ksft_print_header();
> + ksft_set_plan(26);
> +
> + f = fopen("/proc/sys/kernel/pid_max", "r");
> + if (f == NULL)
> + ksft_exit_fail_msg(
> + "%s - Could not open /proc/sys/kernel/pid_max\n",
> + strerror(errno));
> + fscanf(f, "%d", &pid_max);
> + fclose(f);
> + ksft_print_msg("/proc/sys/kernel/pid_max %d\n", pid_max);
> +
> + /* Try invalid settings */
> + memset(&set_tid, 0, sizeof(set_tid));
> + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL + 1, 0, -EINVAL, 0, 0);
> +
> + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2, 0, -EINVAL, 0, 0);
> +
> + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2 + 1, 0,
> + -EINVAL, 0, 0);

\n

> + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 42, 0, -EINVAL, 0, 0);
> +
> + /*
> + * This can actually work if this test running in a MAX_PID_NS_LEVEL - 1
> + * nested PID namespace.
> + */
> + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL - 1, 0, -EINVAL, 0, 0);
> +
> + memset(&set_tid, 0xff, sizeof(set_tid));
> + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL + 1, 0, -EINVAL, 0, 0);
> +
> + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2, 0, -EINVAL, 0, 0);
> +
> + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2 + 1, 0,
> + -EINVAL, 0, 0);

\n

> + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 42, 0, -EINVAL, 0, 0);

\n

> + /*
> + * This can actually work if this test running in a MAX_PID_NS_LEVEL - 1
> + * nested PID namespace.
> + */
> + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL - 1, 0, -EINVAL, 0, 0);
> +
> + memset(&set_tid, 0, sizeof(set_tid));
> + /* Try with an invalid PID */
> + set_tid[0] = 0;
> + test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);

\n

> + set_tid[0] = -1;
> + test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);

\n

> + /* Claim that the set_tid array actually contains 2 elements. */
> + test_clone3_set_tid(set_tid, 2, 0, -EINVAL, 0, 0);

\n

> + /* Try it in a new PID namespace */
> + if (uid == 0)
> + test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0);
> + else
> + ksft_test_result_skip("Clone3() with set_tid requires root\n");
> +
> + /*
> + * Try with a valid PID (1) but as non-root. This should fail

Where are you trying this as non-root? I think that comment is obsolete,
no? :)

> + * with -EPERM if running in the initial user namespace.
> + * As root it should tell us -EEXIST.
> + */
> + set_tid[0] = 1;
> + if (uid == 0)
> + test_clone3_set_tid(set_tid, 1, 0, -EEXIST, 0, 0);
> + else
> + ksft_test_result_skip("Clone3() with set_tid requires root\n");
> +
> + /* Try it in a new PID namespace */
> + if (uid == 0)
> + test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, 0, 0, 0);
> + else
> + ksft_test_result_skip("Clone3() with set_tid requires root\n");
> +
> + /* pid_max should fail everywhere */
> + set_tid[0] = pid_max;
> + test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);

\n

> + if (uid == 0)
> + test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0);
> + else
> + ksft_test_result_skip("Clone3() with set_tid requires root\n");
> +
> + if (uid != 0) {
> + /*
> + * All remaining tests require root. Tell the framework
> + * that all those tests are skipped as non-root.
> + */
> + ksft_cnt.ksft_xskip += ksft_plan - ksft_test_num();
> + goto out;
> + }
> +
> + /* Find the current active PID */
> + pid = fork();
> + if (pid == 0) {
> + ksft_print_msg("Child has PID %d\n", getpid());
> + _exit(EXIT_SUCCESS);
> + }
> + (void)wait(NULL);
> +
> + /* After the child has finished, its PID should be free. */
> + set_tid[0] = pid;
> + test_clone3_set_tid(set_tid, 1, 0, 0, 0, 0);
> +
> + /* This should fail as there is no PID 1 in that namespace */
> + test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0);
> +
> + /*
> + * Creating a process with PID 1 in the newly created most nested
> + * PID namespace and PID 'pid' in the parent PID namespace. This
> + * needs to work.
> + */
> + set_tid[0] = 1;
> + set_tid[1] = pid;
> + test_clone3_set_tid(set_tid, 2, CLONE_NEWPID, 0, pid, 0);
> +
> + ksft_print_msg("unshare PID namespace\n");
> + if (unshare(CLONE_NEWPID) == -1)
> + ksft_exit_fail_msg("unshare(CLONE_NEWPID) failed: %s\n",
> + strerror(errno));
> +
> + set_tid[0] = pid;
> +
> + /* This should fail as there is no PID 1 in that namespace */
> + test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0);
> +
> + /* Let's create a PID 1 */
> + ns_pid = fork();
> + if (ns_pid == 0) {
> + ksft_print_msg("Child in PID namespace has PID %d\n", getpid());
> + set_tid[0] = 2;
> + test_clone3_set_tid(set_tid, 1, 0, 0, 2, 0);

\n

> + set_tid[0] = 1;
> + set_tid[1] = 42;
> + set_tid[2] = pid;
> + /*
> + * This should fail as there are not enough active PID
> + * namespaces. Again assuming this is running in the host's
> + * PID namespace. Not yet nested.
> + */
> + test_clone3_set_tid(set_tid, 4, CLONE_NEWPID, -EINVAL, 0, 0);
> +
> + /*
> + * This should work and from the parent we should see
> + * something like 'NSpid: pid 42 1'.
> + */
> + test_clone3_set_tid(set_tid, 3, CLONE_NEWPID, 0, 42, 1);

Nit: test_clone3_set_tid(set_tid, 3, CLONE_NEWPID, 0, 42, true);
^^^ not 1 :)