The main motivation to add set_tid to clone3() is CRIU.
To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.
Extending clone3() to support *set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).
This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with *set_tid as they are currently in place for ns_last_pid.
The original version of this change was using a single value for
set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
decided to change set_tid to an array to enable setting the PID of a
process in multiple PID namespaces at the same time. If a process is
created in a PID namespace it is possible to influence the PID inside
and outside of the PID namespace. Details also in the corresponding
selftest.
To create a process with the following PIDs:
PID NS level Requested PID
0 (host) 31496
1 42
2 1
For that example the two newly introduced parameters to struct
clone_args (set_tid and set_tid_size) would need to be:
set_tid[0] = 1;
set_tid[1] = 42;
set_tid[2] = 31496;
set_tid_size = 3;
If only the PIDs of the two innermost nested PID namespaces should be
defined it would look like this:
set_tid[0] = 1;
set_tid[1] = 42;
set_tid_size = 2;
The PID of the newly created process would then be the next available
free PID in the PID namespace level 0 (host) and 42 in the PID namespace
at level 1 and the PID of the process in the innermost PID namespace
would be 1.
The set_tid array is used to specify the PID of a process starting
from the innermost nested PID namespaces up to set_tid_size PID namespaces.
set_tid_size cannot be larger then the current PID namespace level.
Signed-off-by: Adrian Reber <[email protected]>
---
v2:
- Removed (size < sizeof(struct clone_args)) as discussed with
Christian and Dmitry
- Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg)
- Use idr_alloc() instead of idr_alloc_cyclic() (Oleg)
v3:
- Return EEXIST if PID is already in use (Christian)
- Drop CLONE_SET_TID (Christian and Oleg)
- Use idr_is_empty() instead of idr_get_cursor() (Oleg)
- Handle different `struct clone_args` sizes (Dmitry)
v4:
- Rework struct size check with defines (Christian)
- Reduce number of set_tid checks (Oleg)
- Less parentheses and more robust code (Oleg)
- Do ns_capable() on correct user_ns (Oleg, Christian)
v5:
- make set_tid checks earlier in alloc_pid() (Christian)
- remove unnecessary comment and struct size check (Christian)
v6:
- remove CLONE_SET_TID from description (Christian)
- add clone3() tests for different clone_args sizes (Christian)
- move more set_tid checks to alloc_pid() (Oleg)
- make all set_tid checks lockless (Oleg)
v7:
- changed set_tid to be an array to set the PID of a process
in multiple nested PID namespaces at the same time as discussed
at LPC 2019 (container MC)
---
include/linux/pid.h | 3 ++-
include/linux/pid_namespace.h | 2 ++
include/linux/sched/task.h | 3 +++
include/uapi/linux/sched.h | 2 ++
kernel/fork.c | 20 +++++++++++++-
kernel/pid.c | 50 ++++++++++++++++++++++++++++++-----
kernel/pid_namespace.c | 2 --
7 files changed, 71 insertions(+), 11 deletions(-)
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 9645b1194c98..034b7df25888 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -120,7 +120,8 @@ extern struct pid *find_vpid(int nr);
extern struct pid *find_get_pid(int nr);
extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
+ size_t set_tid_size);
extern void free_pid(struct pid *pid);
extern void disable_pid_allocation(struct pid_namespace *ns);
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 49538b172483..2ed6af88794b 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -12,6 +12,8 @@
#include <linux/ns_common.h>
#include <linux/idr.h>
+/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
+#define MAX_PID_NS_LEVEL 32
struct fs_pin;
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 4b1c3b664f51..f1879884238e 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -26,6 +26,9 @@ struct kernel_clone_args {
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
+ pid_t *set_tid;
+ /* Number of elements in *set_tid */
+ size_t set_tid_size;
};
/*
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 25b4fa00bad1..13f74c40a629 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -72,6 +72,8 @@ struct clone_args {
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
+ __aligned_u64 set_tid;
+ __aligned_u64 set_tid_size;
};
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 55af6931c6ec..c70b9224997d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2026,7 +2026,8 @@ static __latent_entropy struct task_struct *copy_process(
stackleak_task_init(p);
if (pid != &init_struct_pid) {
- pid = alloc_pid(p->nsproxy->pid_ns_for_children);
+ pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid,
+ args->set_tid_size);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_thread;
@@ -2527,6 +2528,7 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
struct clone_args __user *uargs,
size_t usize)
{
+ int i;
int err;
struct clone_args args;
@@ -2539,6 +2541,9 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
if (err)
return err;
+ if (unlikely(args.set_tid_size > MAX_PID_NS_LEVEL))
+ return -E2BIG;
+
/*
* Verify that higher 32bits of exit_signal are unset and that
* it is a valid signal
@@ -2556,8 +2561,17 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
.stack = args.stack,
.stack_size = args.stack_size,
.tls = args.tls,
+ .set_tid = kargs->set_tid,
+ .set_tid_size = args.set_tid_size,
};
+ for (i = 0; i < args.set_tid_size; i++) {
+ if (copy_from_user(&kargs->set_tid[i],
+ u64_to_user_ptr(args.set_tid + (i * sizeof(args.set_tid))),
+ sizeof(pid_t)))
+ return -EFAULT;
+ }
+
return 0;
}
@@ -2631,6 +2645,10 @@ SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
int err;
struct kernel_clone_args kargs;
+ pid_t set_tid[MAX_PID_NS_LEVEL];
+
+ memset(set_tid, 0, sizeof(set_tid));
+ kargs.set_tid = set_tid;
err = copy_clone_args_from_user(&kargs, uargs, size);
if (err)
diff --git a/kernel/pid.c b/kernel/pid.c
index 0a9f2e437217..82b3f91c131c 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -157,7 +157,8 @@ void free_pid(struct pid *pid)
call_rcu(&pid->rcu, delayed_put_pid);
}
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
+ size_t set_tid_size)
{
struct pid *pid;
enum pid_type type;
@@ -166,6 +167,17 @@ struct pid *alloc_pid(struct pid_namespace *ns)
struct upid *upid;
int retval = -ENOMEM;
+ /*
+ * set_tid_size contains the size of the set_tid array. Starting at
+ * the most nested currently active PID namespace it tells alloc_pid()
+ * which PID to set for a process in that most nested PID namespace
+ * up to set_tid_size PID namespaces. It does not have to set the PID
+ * for a process in all nested PID namespaces but set_tid_size must
+ * never be greater than the current ns->level + 1.
+ */
+ if (set_tid_size > ns->level + 1)
+ return ERR_PTR(-EINVAL);
+
pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
if (!pid)
return ERR_PTR(retval);
@@ -175,6 +187,18 @@ struct pid *alloc_pid(struct pid_namespace *ns)
for (i = ns->level; i >= 0; i--) {
int pid_min = 1;
+ int t_pos = 0;
+
+ if (set_tid_size) {
+ t_pos = - (i - ns->level);
+ if (set_tid[t_pos] < 1 || set_tid[t_pos] >= pid_max)
+ return ERR_PTR(-EINVAL);
+ /* Also fail if a PID != 1 is requested and no PID 1 exists */
+ if (set_tid[t_pos] != 1 && !tmp->child_reaper)
+ return ERR_PTR(-EINVAL);
+ if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
+ return ERR_PTR(-EPERM);
+ }
idr_preload(GFP_KERNEL);
spin_lock_irq(&pidmap_lock);
@@ -186,12 +210,24 @@ struct pid *alloc_pid(struct pid_namespace *ns)
if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
pid_min = RESERVED_PIDS;
- /*
- * Store a null pointer so find_pid_ns does not find
- * a partially initialized PID (see below).
- */
- nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
- pid_max, GFP_ATOMIC);
+ if (set_tid_size) {
+ nr = idr_alloc(&tmp->idr, NULL, set_tid[t_pos],
+ set_tid[t_pos] + 1, GFP_ATOMIC);
+ /*
+ * If ENOSPC is returned it means that the PID is
+ * alreay in use. Return EEXIST in that case.
+ */
+ if (nr == -ENOSPC)
+ nr = -EEXIST;
+ set_tid_size--;
+ } else {
+ /*
+ * Store a null pointer so find_pid_ns does not find
+ * a partially initialized PID (see below).
+ */
+ nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
+ pid_max, GFP_ATOMIC);
+ }
spin_unlock_irq(&pidmap_lock);
idr_preload_end();
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index a6a79f85c81a..d40017e79ebe 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -26,8 +26,6 @@
static DEFINE_MUTEX(pid_caches_mutex);
static struct kmem_cache *pid_ns_cachep;
-/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
-#define MAX_PID_NS_LEVEL 32
/* Write once array, filled from the beginning. */
static struct kmem_cache *pid_cache[MAX_PID_NS_LEVEL];
--
2.23.0
On 11/11, Adrian Reber wrote:
>
> v7:
> - changed set_tid to be an array to set the PID of a process
> in multiple nested PID namespaces at the same time as discussed
> at LPC 2019 (container MC)
cough... iirc you convinced me this is not needed when we discussed
the previous version ;) Nevermind, probably my memory fools me.
So far I only have some cosmetic nits,
> @@ -175,6 +187,18 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>
> for (i = ns->level; i >= 0; i--) {
> int pid_min = 1;
> + int t_pos = 0;
^^^^^
I won't insist, but I'd suggest to cache set_tid[t_pos] instead to make
the code a bit more simple.
> @@ -186,12 +210,24 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
> pid_min = RESERVED_PIDS;
You can probably move this code into the "else" branch below.
IOW, something like
for (i = ns->level; i >= 0; i--) {
int xxx = 0;
if (set_tid_size) {
int pos = ns->level - i;
xxx = set_tid[pos];
if (xxx < 1 || xxx >= pid_max)
return ERR_PTR(-EINVAL);
/* Also fail if a PID != 1 is requested and no PID 1 exists */
if (xxx != 1 && !tmp->child_reaper)
return ERR_PTR(-EINVAL);
if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
return ERR_PTR(-EPERM);
set_tid_size--;
}
idr_preload(GFP_KERNEL);
spin_lock_irq(&pidmap_lock);
if (xxx) {
nr = idr_alloc(&tmp->idr, NULL, xxx, xxx + 1,
GFP_ATOMIC);
/*
* If ENOSPC is returned it means that the PID is
* alreay in use. Return EEXIST in that case.
*/
if (nr == -ENOSPC)
nr = -EEXIST;
} else {
int pid_min = 1;
/*
* init really needs pid 1, but after reaching the
* maximum wrap back to RESERVED_PIDS
*/
if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
pid_min = RESERVED_PIDS;
/*
* Store a null pointer so find_pid_ns does not find
* a partially initialized PID (see below).
*/
nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
pid_max, GFP_ATOMIC);
}
...
This way only the "if (set_tid_size)" block has to play with set_tid_size/set_tid.
note also that this way we can easily allow set_tid[some_level] == 0, we can
simply do
- if (xxx < 1 || xxx >= pid_max)
+ if (xxx < 0 || xxx >= pid_max)
although I don't think this is really useful.
Oleg.
On Mon, Nov 11, 2019 at 04:25:15PM +0100, Oleg Nesterov wrote:
> On 11/11, Adrian Reber wrote:
> >
> > v7:
> > - changed set_tid to be an array to set the PID of a process
> > in multiple nested PID namespaces at the same time as discussed
> > at LPC 2019 (container MC)
>
> cough... iirc you convinced me this is not needed when we discussed
> the previous version ;) Nevermind, probably my memory fools me.
You are right. You suggested the same thing and we didn't listen ;)
> So far I only have some cosmetic nits,
Thanks for the quick review. I will try to apply your suggestions.
> > @@ -175,6 +187,18 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> >
> > for (i = ns->level; i >= 0; i--) {
> > int pid_min = 1;
> > + int t_pos = 0;
> ^^^^^
>
> I won't insist, but I'd suggest to cache set_tid[t_pos] instead to make
> the code a bit more simple.
>
> > @@ -186,12 +210,24 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> > if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
> > pid_min = RESERVED_PIDS;
>
> You can probably move this code into the "else" branch below.
>
> IOW, something like
>
>
> for (i = ns->level; i >= 0; i--) {
> int xxx = 0;
>
> if (set_tid_size) {
> int pos = ns->level - i;
>
> xxx = set_tid[pos];
> if (xxx < 1 || xxx >= pid_max)
> return ERR_PTR(-EINVAL);
> /* Also fail if a PID != 1 is requested and no PID 1 exists */
> if (xxx != 1 && !tmp->child_reaper)
> return ERR_PTR(-EINVAL);
> if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
> return ERR_PTR(-EPERM);
> set_tid_size--;
> }
>
> idr_preload(GFP_KERNEL);
> spin_lock_irq(&pidmap_lock);
>
> if (xxx) {
> nr = idr_alloc(&tmp->idr, NULL, xxx, xxx + 1,
> GFP_ATOMIC);
> /*
> * If ENOSPC is returned it means that the PID is
> * alreay in use. Return EEXIST in that case.
> */
> if (nr == -ENOSPC)
> nr = -EEXIST;
> } else {
> int pid_min = 1;
> /*
> * init really needs pid 1, but after reaching the
> * maximum wrap back to RESERVED_PIDS
> */
> if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
> pid_min = RESERVED_PIDS;
> /*
> * Store a null pointer so find_pid_ns does not find
> * a partially initialized PID (see below).
> */
> nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> pid_max, GFP_ATOMIC);
> }
>
> ...
>
> This way only the "if (set_tid_size)" block has to play with set_tid_size/set_tid.
>
> note also that this way we can easily allow set_tid[some_level] == 0, we can
> simply do
>
> - if (xxx < 1 || xxx >= pid_max)
> + if (xxx < 0 || xxx >= pid_max)
>
> although I don't think this is really useful.
Yes. I explicitly didn't allow 0 as a PID as I didn't thought it would
be useful (or maybe even valid).
Adrian
On Mon, Nov 11, 2019 at 02:17:03PM +0100, Adrian Reber wrote:
> The main motivation to add set_tid to clone3() is CRIU.
>
> To restore a process with the same PID/TID CRIU currently uses
> /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
> ns_last_pid and then (quickly) does a clone(). This works most of the
> time, but it is racy. It is also slow as it requires multiple syscalls.
>
> Extending clone3() to support *set_tid makes it possible restore a
> process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
> race free (as long as the desired PID/TID is available).
>
> This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
> on clone3() with *set_tid as they are currently in place for ns_last_pid.
>
> The original version of this change was using a single value for
> set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
> decided to change set_tid to an array to enable setting the PID of a
> process in multiple PID namespaces at the same time. If a process is
> created in a PID namespace it is possible to influence the PID inside
> and outside of the PID namespace. Details also in the corresponding
> selftest.
>
> To create a process with the following PIDs:
>
> PID NS level Requested PID
> 0 (host) 31496
> 1 42
> 2 1
>
> For that example the two newly introduced parameters to struct
> clone_args (set_tid and set_tid_size) would need to be:
>
> set_tid[0] = 1;
> set_tid[1] = 42;
> set_tid[2] = 31496;
> set_tid_size = 3;
>
> If only the PIDs of the two innermost nested PID namespaces should be
> defined it would look like this:
>
> set_tid[0] = 1;
> set_tid[1] = 42;
> set_tid_size = 2;
>
> The PID of the newly created process would then be the next available
> free PID in the PID namespace level 0 (host) and 42 in the PID namespace
> at level 1 and the PID of the process in the innermost PID namespace
> would be 1.
I'm not a fan of this array-based patch tbh, especially since afaict
CRIU has only vague plans to support restoring nested pid namespaces but
I won't stand in the way. :)
>
> The set_tid array is used to specify the PID of a process starting
> from the innermost nested PID namespaces up to set_tid_size PID namespaces.
>
> set_tid_size cannot be larger then the current PID namespace level.
>
> Signed-off-by: Adrian Reber <[email protected]>
> ---
> v2:
> - Removed (size < sizeof(struct clone_args)) as discussed with
> Christian and Dmitry
> - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg)
> - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg)
>
> v3:
> - Return EEXIST if PID is already in use (Christian)
> - Drop CLONE_SET_TID (Christian and Oleg)
> - Use idr_is_empty() instead of idr_get_cursor() (Oleg)
> - Handle different `struct clone_args` sizes (Dmitry)
>
> v4:
> - Rework struct size check with defines (Christian)
> - Reduce number of set_tid checks (Oleg)
> - Less parentheses and more robust code (Oleg)
> - Do ns_capable() on correct user_ns (Oleg, Christian)
>
> v5:
> - make set_tid checks earlier in alloc_pid() (Christian)
> - remove unnecessary comment and struct size check (Christian)
>
> v6:
> - remove CLONE_SET_TID from description (Christian)
> - add clone3() tests for different clone_args sizes (Christian)
> - move more set_tid checks to alloc_pid() (Oleg)
> - make all set_tid checks lockless (Oleg)
>
> v7:
> - changed set_tid to be an array to set the PID of a process
> in multiple nested PID namespaces at the same time as discussed
> at LPC 2019 (container MC)
> ---
> include/linux/pid.h | 3 ++-
> include/linux/pid_namespace.h | 2 ++
> include/linux/sched/task.h | 3 +++
> include/uapi/linux/sched.h | 2 ++
> kernel/fork.c | 20 +++++++++++++-
> kernel/pid.c | 50 ++++++++++++++++++++++++++++++-----
> kernel/pid_namespace.c | 2 --
> 7 files changed, 71 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index 9645b1194c98..034b7df25888 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -120,7 +120,8 @@ extern struct pid *find_vpid(int nr);
> extern struct pid *find_get_pid(int nr);
> extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
>
> -extern struct pid *alloc_pid(struct pid_namespace *ns);
> +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> + size_t set_tid_size);
> extern void free_pid(struct pid *pid);
> extern void disable_pid_allocation(struct pid_namespace *ns);
>
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 49538b172483..2ed6af88794b 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -12,6 +12,8 @@
> #include <linux/ns_common.h>
> #include <linux/idr.h>
>
> +/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
> +#define MAX_PID_NS_LEVEL 32
>
> struct fs_pin;
>
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 4b1c3b664f51..f1879884238e 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -26,6 +26,9 @@ struct kernel_clone_args {
> unsigned long stack;
> unsigned long stack_size;
> unsigned long tls;
> + pid_t *set_tid;
> + /* Number of elements in *set_tid */
> + size_t set_tid_size;
> };
>
> /*
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 25b4fa00bad1..13f74c40a629 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -72,6 +72,8 @@ struct clone_args {
> __aligned_u64 stack;
> __aligned_u64 stack_size;
> __aligned_u64 tls;
> + __aligned_u64 set_tid;
> + __aligned_u64 set_tid_size;
> };
> #endif
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 55af6931c6ec..c70b9224997d 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2026,7 +2026,8 @@ static __latent_entropy struct task_struct *copy_process(
> stackleak_task_init(p);
>
> if (pid != &init_struct_pid) {
> - pid = alloc_pid(p->nsproxy->pid_ns_for_children);
> + pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid,
> + args->set_tid_size);
> if (IS_ERR(pid)) {
> retval = PTR_ERR(pid);
> goto bad_fork_cleanup_thread;
> @@ -2527,6 +2528,7 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> struct clone_args __user *uargs,
> size_t usize)
> {
> + int i;
> int err;
> struct clone_args args;
>
> @@ -2539,6 +2541,9 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> if (err)
> return err;
>
> + if (unlikely(args.set_tid_size > MAX_PID_NS_LEVEL))
> + return -E2BIG;
I'd prefer EINVAL for this case. Mostly because we do that for other
arguments already. E2BIG currently only reflects invalid size of struct
clone_args itself and which makes it easy to spot for userspace and I'd
like to not overload that error too (The not-overloading-EINVAL-ship
has already sailed a long time ago.).
Also, it seems this misses some more checks, maybe?
if (!args.set_tid && args.set_tid_size > 0)
return -EINVAL;
if (args.set_tid && args.set_tid_size == 0)
return -EINVAL;
> +
> /*
> * Verify that higher 32bits of exit_signal are unset and that
> * it is a valid signal
> @@ -2556,8 +2561,17 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> .stack = args.stack,
> .stack_size = args.stack_size,
> .tls = args.tls,
> + .set_tid = kargs->set_tid,
Everything setup here uses members from "args" not "kargs".
That difference is very easy to overlook should anything change in this
code in the future. I'd prefer if setting up both (set_tid and
set_tid_size) or at least the set_tid pointer would be done later,
after the copy_from_user() stuff.
So at the top of this function you could do:
pid_t *kset_tid = kargs->set_tid;
> + .set_tid_size = args.set_tid_size,
> };
>
> + for (i = 0; i < args.set_tid_size; i++) {
> + if (copy_from_user(&kargs->set_tid[i],
> + u64_to_user_ptr(args.set_tid + (i * sizeof(args.set_tid))),
> + sizeof(pid_t)))
> + return -EFAULT;
> + }
All of this indexing doesn't look very nice and introduces racyness.
Can't we do something like:
if (args->set_tid && copy_from_user(kset_tid, u64_to_user_ptr(args.set_tid), (kargs->set_tid_size * sizeof(pid_t))))
return -EFAULT;
kargs->set_tid = kset_tid;
Christian
On Mon, Nov 11, 2019 at 04:40:28PM +0100, Adrian Reber wrote:
> On Mon, Nov 11, 2019 at 04:25:15PM +0100, Oleg Nesterov wrote:
> > On 11/11, Adrian Reber wrote:
> > >
> > > v7:
> > > - changed set_tid to be an array to set the PID of a process
> > > in multiple nested PID namespaces at the same time as discussed
> > > at LPC 2019 (container MC)
> >
> > cough... iirc you convinced me this is not needed when we discussed
> > the previous version ;) Nevermind, probably my memory fools me.
>
> You are right. You suggested the same thing and we didn't listen ;)
>
> > So far I only have some cosmetic nits,
>
> Thanks for the quick review. I will try to apply your suggestions.
>
> > > @@ -175,6 +187,18 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> > >
> > > for (i = ns->level; i >= 0; i--) {
> > > int pid_min = 1;
> > > + int t_pos = 0;
> > ^^^^^
> >
> > I won't insist, but I'd suggest to cache set_tid[t_pos] instead to make
> > the code a bit more simple.
> >
> > > @@ -186,12 +210,24 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> > > if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
> > > pid_min = RESERVED_PIDS;
> >
> > You can probably move this code into the "else" branch below.
> >
> > IOW, something like
> >
> >
> > for (i = ns->level; i >= 0; i--) {
> > int xxx = 0;
> >
> > if (set_tid_size) {
> > int pos = ns->level - i;
> >
> > xxx = set_tid[pos];
> > if (xxx < 1 || xxx >= pid_max)
> > return ERR_PTR(-EINVAL);
> > /* Also fail if a PID != 1 is requested and no PID 1 exists */
> > if (xxx != 1 && !tmp->child_reaper)
> > return ERR_PTR(-EINVAL);
> > if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
> > return ERR_PTR(-EPERM);
> > set_tid_size--;
> > }
> >
> > idr_preload(GFP_KERNEL);
> > spin_lock_irq(&pidmap_lock);
> >
> > if (xxx) {
> > nr = idr_alloc(&tmp->idr, NULL, xxx, xxx + 1,
> > GFP_ATOMIC);
> > /*
> > * If ENOSPC is returned it means that the PID is
> > * alreay in use. Return EEXIST in that case.
> > */
> > if (nr == -ENOSPC)
> > nr = -EEXIST;
> > } else {
> > int pid_min = 1;
> > /*
> > * init really needs pid 1, but after reaching the
> > * maximum wrap back to RESERVED_PIDS
> > */
> > if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
> > pid_min = RESERVED_PIDS;
> > /*
> > * Store a null pointer so find_pid_ns does not find
> > * a partially initialized PID (see below).
> > */
> > nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> > pid_max, GFP_ATOMIC);
> > }
> >
> > ...
> >
> > This way only the "if (set_tid_size)" block has to play with set_tid_size/set_tid.
> >
> > note also that this way we can easily allow set_tid[some_level] == 0, we can
> > simply do
> >
> > - if (xxx < 1 || xxx >= pid_max)
> > + if (xxx < 0 || xxx >= pid_max)
> >
> > although I don't think this is really useful.
>
> Yes. I explicitly didn't allow 0 as a PID as I didn't thought it would
> be useful (or maybe even valid).
How do you express: I don't care about a specific pid in pidns level
<n>, just give me a random one? For example,
set_tid[0] = 1234
set_tid[1] = 5678
set_tid[2] = random_pid()
set_tid[3] = 9
Wouldn't that be potentially useful?
Christian
On 11/11, Christian Brauner wrote:
>
> On Mon, Nov 11, 2019 at 04:40:28PM +0100, Adrian Reber wrote:
> > > note also that this way we can easily allow set_tid[some_level] == 0, we can
> > > simply do
> > >
> > > - if (xxx < 1 || xxx >= pid_max)
> > > + if (xxx < 0 || xxx >= pid_max)
> > >
> > > although I don't think this is really useful.
> >
> > Yes. I explicitly didn't allow 0 as a PID as I didn't thought it would
> > be useful (or maybe even valid).
>
> How do you express: I don't care about a specific pid in pidns level
> <n>,
Yes,
> Wouldn't that be potentially useful?
As I said above, I don't think this is really useful. Just I was thinking
out loud.
I suggested to cache set_tid[pos] rather than pos because this makes the
code simpler, not because this allows this allows to "naturally" handle
the case when set_tid[pos] == 0. Sorry it it was not clear.
Oleg.
On 11/11/2019 14.17, Adrian Reber wrote:
> The main motivation to add set_tid to clone3() is CRIU.
>
> To restore a process with the same PID/TID CRIU currently uses
> /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
> ns_last_pid and then (quickly) does a clone(). This works most of the
> time, but it is racy. It is also slow as it requires multiple syscalls.
>
> Extending clone3() to support *set_tid makes it possible restore a
> process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
> race free (as long as the desired PID/TID is available).
>
> This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
> on clone3() with *set_tid as they are currently in place for ns_last_pid.
>
> The original version of this change was using a single value for
> set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
> decided to change set_tid to an array to enable setting the PID of a
> process in multiple PID namespaces at the same time. If a process is
> created in a PID namespace it is possible to influence the PID inside
> and outside of the PID namespace. Details also in the corresponding
> selftest.
>
> /*
> * Verify that higher 32bits of exit_signal are unset and that
> * it is a valid signal
> @@ -2556,8 +2561,17 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> .stack = args.stack,
> .stack_size = args.stack_size,
> .tls = args.tls,
> + .set_tid = kargs->set_tid,
> + .set_tid_size = args.set_tid_size,
> };
This is a bit ugly. And is it even well-defined? I mean, it's a bit
similar to the "i = i++;". So it would be best to avoid.
> + for (i = 0; i < args.set_tid_size; i++) {
> + if (copy_from_user(&kargs->set_tid[i],
> + u64_to_user_ptr(args.set_tid + (i * sizeof(args.set_tid))),
> + sizeof(pid_t)))
> + return -EFAULT;
> + }
> +
If I'm reading this (and your test case) right, you expect the user
pointer to point at an array of u64, and here you're copying the first
half of each u64 to the pid_t array. That only works on little-endian.
It seems more obvious (since I don't think there's any disagreement
anywhere on sizeof(pid_t)) to expect the user pointer to point at an
array of pid_t and then simply copy_from_user() the whole thing in one go.
> return 0;
> }
>
> @@ -2631,6 +2645,10 @@ SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> int err;
>
> struct kernel_clone_args kargs;
> + pid_t set_tid[MAX_PID_NS_LEVEL];
> +
> + memset(set_tid, 0, sizeof(set_tid));
> + kargs.set_tid = set_tid;
Hm, isn't it a bit much to add two cachelines (and dirtying them via the
memset) to the stack footprint of clone3, considering that almost nobody
(relatively speaking) will use this?
So how about copy_clone_args_from_user() does
if (args.set_tid) {
set_tid = memdup_user(u64_to_user_ptr(), ...)
if (IS_ERR(set_tid))
return PTR_ERR(set_tid);
kargs.set_tid = set_tid;
}
Then somebody needs to free that, but this is probably not the last
clone extension that might need extra data, so one could do
s/long _do_fork/static long __do_fork/
and then create a _do_fork that always cleans up the passed-in kargs, i.e.
long _do_fork(struct kargs *args)
{
long ret = __do_fork(args);
kfree(args->set_tid);
return ret;
}
Rasmus
Christian Brauner <[email protected]> writes:
> On Mon, Nov 11, 2019 at 04:40:28PM +0100, Adrian Reber wrote:
>> On Mon, Nov 11, 2019 at 04:25:15PM +0100, Oleg Nesterov wrote:
>> > On 11/11, Adrian Reber wrote:
>> > >
>> > > v7:
>> > > - changed set_tid to be an array to set the PID of a process
>> > > in multiple nested PID namespaces at the same time as discussed
>> > > at LPC 2019 (container MC)
>> >
>> > cough... iirc you convinced me this is not needed when we discussed
>> > the previous version ;) Nevermind, probably my memory fools me.
>>
>> You are right. You suggested the same thing and we didn't listen ;)
>>
>> > So far I only have some cosmetic nits,
>>
>> Thanks for the quick review. I will try to apply your suggestions.
>>
>> > > @@ -175,6 +187,18 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>> > >
>> > > for (i = ns->level; i >= 0; i--) {
>> > > int pid_min = 1;
>> > > + int t_pos = 0;
>> > ^^^^^
>> >
>> > I won't insist, but I'd suggest to cache set_tid[t_pos] instead to make
>> > the code a bit more simple.
>> >
>> > > @@ -186,12 +210,24 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>> > > if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
>> > > pid_min = RESERVED_PIDS;
>> >
>> > You can probably move this code into the "else" branch below.
>> >
>> > IOW, something like
>> >
>> >
>> > for (i = ns->level; i >= 0; i--) {
>> > int xxx = 0;
>> >
>> > if (set_tid_size) {
>> > int pos = ns->level - i;
>> >
>> > xxx = set_tid[pos];
>> > if (xxx < 1 || xxx >= pid_max)
>> > return ERR_PTR(-EINVAL);
>> > /* Also fail if a PID != 1 is requested and no PID 1 exists */
>> > if (xxx != 1 && !tmp->child_reaper)
>> > return ERR_PTR(-EINVAL);
>> > if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
>> > return ERR_PTR(-EPERM);
>> > set_tid_size--;
>> > }
>> >
>> > idr_preload(GFP_KERNEL);
>> > spin_lock_irq(&pidmap_lock);
>> >
>> > if (xxx) {
>> > nr = idr_alloc(&tmp->idr, NULL, xxx, xxx + 1,
>> > GFP_ATOMIC);
>> > /*
>> > * If ENOSPC is returned it means that the PID is
>> > * alreay in use. Return EEXIST in that case.
>> > */
>> > if (nr == -ENOSPC)
>> > nr = -EEXIST;
>> > } else {
>> > int pid_min = 1;
>> > /*
>> > * init really needs pid 1, but after reaching the
>> > * maximum wrap back to RESERVED_PIDS
>> > */
>> > if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
>> > pid_min = RESERVED_PIDS;
>> > /*
>> > * Store a null pointer so find_pid_ns does not find
>> > * a partially initialized PID (see below).
>> > */
>> > nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
>> > pid_max, GFP_ATOMIC);
>> > }
>> >
>> > ...
>> >
>> > This way only the "if (set_tid_size)" block has to play with set_tid_size/set_tid.
>> >
>> > note also that this way we can easily allow set_tid[some_level] == 0, we can
>> > simply do
>> >
>> > - if (xxx < 1 || xxx >= pid_max)
>> > + if (xxx < 0 || xxx >= pid_max)
>> >
>> > although I don't think this is really useful.
>>
>> Yes. I explicitly didn't allow 0 as a PID as I didn't thought it would
>> be useful (or maybe even valid).
I agree not allowing 0 sounds very reasonable.
> How do you express: I don't care about a specific pid in pidns level
> <n>, just give me a random one? For example,
>
> set_tid[0] = 1234
> set_tid[1] = 5678
> set_tid[2] = random_pid()
> set_tid[3] = 9
>
> Wouldn't that be potentially useful?
I can't imagine how.
At least in my head the fundamental concept is picking up a container on
one machine and moving it to another machine. For that operation you
will know starting with the most nested pid namespace the pids that you
want up to some point. Farther up you don't know.
I can't imagine in what scenario you would not know a pid at outer level
but want a specified pid at an ever farther removed outer level. What
scenario are you thinking about that could lead to such a situation?
For the me the question is: Are you restoring what you know or not?
Eric
On Mon, Nov 11, 2019 at 05:08:57PM -0600, Eric W. Biederman wrote:
> Christian Brauner <[email protected]> writes:
>
> > On Mon, Nov 11, 2019 at 04:40:28PM +0100, Adrian Reber wrote:
> >> On Mon, Nov 11, 2019 at 04:25:15PM +0100, Oleg Nesterov wrote:
> >> > On 11/11, Adrian Reber wrote:
> >> > >
> >> > > v7:
> >> > > - changed set_tid to be an array to set the PID of a process
> >> > > in multiple nested PID namespaces at the same time as discussed
> >> > > at LPC 2019 (container MC)
> >> >
> >> > cough... iirc you convinced me this is not needed when we discussed
> >> > the previous version ;) Nevermind, probably my memory fools me.
> >>
> >> You are right. You suggested the same thing and we didn't listen ;)
> >>
> >> > So far I only have some cosmetic nits,
> >>
> >> Thanks for the quick review. I will try to apply your suggestions.
> >>
> >> > > @@ -175,6 +187,18 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> >> > >
> >> > > for (i = ns->level; i >= 0; i--) {
> >> > > int pid_min = 1;
> >> > > + int t_pos = 0;
> >> > ^^^^^
> >> >
> >> > I won't insist, but I'd suggest to cache set_tid[t_pos] instead to make
> >> > the code a bit more simple.
> >> >
> >> > > @@ -186,12 +210,24 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> >> > > if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
> >> > > pid_min = RESERVED_PIDS;
> >> >
> >> > You can probably move this code into the "else" branch below.
> >> >
> >> > IOW, something like
> >> >
> >> >
> >> > for (i = ns->level; i >= 0; i--) {
> >> > int xxx = 0;
> >> >
> >> > if (set_tid_size) {
> >> > int pos = ns->level - i;
> >> >
> >> > xxx = set_tid[pos];
> >> > if (xxx < 1 || xxx >= pid_max)
> >> > return ERR_PTR(-EINVAL);
> >> > /* Also fail if a PID != 1 is requested and no PID 1 exists */
> >> > if (xxx != 1 && !tmp->child_reaper)
> >> > return ERR_PTR(-EINVAL);
> >> > if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
> >> > return ERR_PTR(-EPERM);
> >> > set_tid_size--;
> >> > }
> >> >
> >> > idr_preload(GFP_KERNEL);
> >> > spin_lock_irq(&pidmap_lock);
> >> >
> >> > if (xxx) {
> >> > nr = idr_alloc(&tmp->idr, NULL, xxx, xxx + 1,
> >> > GFP_ATOMIC);
> >> > /*
> >> > * If ENOSPC is returned it means that the PID is
> >> > * alreay in use. Return EEXIST in that case.
> >> > */
> >> > if (nr == -ENOSPC)
> >> > nr = -EEXIST;
> >> > } else {
> >> > int pid_min = 1;
> >> > /*
> >> > * init really needs pid 1, but after reaching the
> >> > * maximum wrap back to RESERVED_PIDS
> >> > */
> >> > if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
> >> > pid_min = RESERVED_PIDS;
> >> > /*
> >> > * Store a null pointer so find_pid_ns does not find
> >> > * a partially initialized PID (see below).
> >> > */
> >> > nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> >> > pid_max, GFP_ATOMIC);
> >> > }
> >> >
> >> > ...
> >> >
> >> > This way only the "if (set_tid_size)" block has to play with set_tid_size/set_tid.
> >> >
> >> > note also that this way we can easily allow set_tid[some_level] == 0, we can
> >> > simply do
> >> >
> >> > - if (xxx < 1 || xxx >= pid_max)
> >> > + if (xxx < 0 || xxx >= pid_max)
> >> >
> >> > although I don't think this is really useful.
> >>
> >> Yes. I explicitly didn't allow 0 as a PID as I didn't thought it would
> >> be useful (or maybe even valid).
>
> I agree not allowing 0 sounds very reasonable.
Yeah, I think we are all in agreement here.
>
> > How do you express: I don't care about a specific pid in pidns level
> > <n>, just give me a random one? For example,
> >
> > set_tid[0] = 1234
> > set_tid[1] = 5678
> > set_tid[2] = random_pid()
> > set_tid[3] = 9
> >
> > Wouldn't that be potentially useful?
>
> I can't imagine how.
>
> At least in my head the fundamental concept is picking up a container on
> one machine and moving it to another machine. For that operation you
> will know starting with the most nested pid namespace the pids that you
> want up to some point. Farther up you don't know.
>
> I can't imagine in what scenario you would not know a pid at outer level
> but want a specified pid at an ever farther removed outer level. What
> scenario are you thinking about that could lead to such a situation?
>
> For the me the question is: Are you restoring what you know or not?
I didn't advocate for making this possible (though I can see how this
would be a neat hacking tool).
Though this whole paragraph highlights one of my concerns with this
whole feature. As it stands it is _only_ useful to CRIU. Which as I said
before is fine but it still makes me queasy when an interface really
just is designed to serve a single use-case; this specific feature even
just a single user.
I'm hopeful that we can find other use-cases for testing. It's probably
already a fun feature for making pid-reuse based kernel exploits way
easier.
Christian
On Mon, Nov 11, 2019 at 09:41:39PM +0100, Rasmus Villemoes wrote:
> On 11/11/2019 14.17, Adrian Reber wrote:
> > The main motivation to add set_tid to clone3() is CRIU.
> >
> > To restore a process with the same PID/TID CRIU currently uses
> > /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
> > ns_last_pid and then (quickly) does a clone(). This works most of the
> > time, but it is racy. It is also slow as it requires multiple syscalls.
> >
> > Extending clone3() to support *set_tid makes it possible restore a
> > process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
> > race free (as long as the desired PID/TID is available).
> >
> > This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
> > on clone3() with *set_tid as they are currently in place for ns_last_pid.
> >
> > The original version of this change was using a single value for
> > set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
> > decided to change set_tid to an array to enable setting the PID of a
> > process in multiple PID namespaces at the same time. If a process is
> > created in a PID namespace it is possible to influence the PID inside
> > and outside of the PID namespace. Details also in the corresponding
> > selftest.
> >
>
> > /*
> > * Verify that higher 32bits of exit_signal are unset and that
> > * it is a valid signal
> > @@ -2556,8 +2561,17 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> > .stack = args.stack,
> > .stack_size = args.stack_size,
> > .tls = args.tls,
> > + .set_tid = kargs->set_tid,
> > + .set_tid_size = args.set_tid_size,
> > };
>
> This is a bit ugly. And is it even well-defined? I mean, it's a bit
> similar to the "i = i++;". So it would be best to avoid.
>
> > + for (i = 0; i < args.set_tid_size; i++) {
> > + if (copy_from_user(&kargs->set_tid[i],
> > + u64_to_user_ptr(args.set_tid + (i * sizeof(args.set_tid))),
> > + sizeof(pid_t)))
> > + return -EFAULT;
> > + }
> > +
>
> If I'm reading this (and your test case) right, you expect the user
> pointer to point at an array of u64, and here you're copying the first
> half of each u64 to the pid_t array. That only works on little-endian.
>
> It seems more obvious (since I don't think there's any disagreement
> anywhere on sizeof(pid_t)) to expect the user pointer to point at an
> array of pid_t and then simply copy_from_user() the whole thing in one go.
>
> > return 0;
> > }
> >
> > @@ -2631,6 +2645,10 @@ SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> > int err;
> >
> > struct kernel_clone_args kargs;
> > + pid_t set_tid[MAX_PID_NS_LEVEL];
> > +
> > + memset(set_tid, 0, sizeof(set_tid));
> > + kargs.set_tid = set_tid;
>
> Hm, isn't it a bit much to add two cachelines (and dirtying them via the
> memset) to the stack footprint of clone3, considering that almost nobody
> (relatively speaking) will use this?
>
> So how about copy_clone_args_from_user() does
>
> if (args.set_tid) {
> set_tid = memdup_user(u64_to_user_ptr(), ...)
> if (IS_ERR(set_tid))
> return PTR_ERR(set_tid);
> kargs.set_tid = set_tid;
> }
>
> Then somebody needs to free that, but this is probably not the last
> clone extension that might need extra data, so one could do
>
> s/long _do_fork/static long __do_fork/
>
> and then create a _do_fork that always cleans up the passed-in kargs, i.e.
>
> long _do_fork(struct kargs *args)
> {
> long ret = __do_fork(args);
> kfree(args->set_tid);
> return ret;
> }
Thanks for your review. Did you had a look at what Christian suggested?
That should solve most of the points you mentioned. I will also remove
the memset() as it is not necessary at all.
Adrian
On Mon, Nov 11, 2019 at 09:41:39PM +0100, Rasmus Villemoes wrote:
> On 11/11/2019 14.17, Adrian Reber wrote:
> > The main motivation to add set_tid to clone3() is CRIU.
> >
> > To restore a process with the same PID/TID CRIU currently uses
> > /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
> > ns_last_pid and then (quickly) does a clone(). This works most of the
> > time, but it is racy. It is also slow as it requires multiple syscalls.
> >
> > Extending clone3() to support *set_tid makes it possible restore a
> > process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
> > race free (as long as the desired PID/TID is available).
> >
> > This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
> > on clone3() with *set_tid as they are currently in place for ns_last_pid.
> >
> > The original version of this change was using a single value for
> > set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
> > decided to change set_tid to an array to enable setting the PID of a
> > process in multiple PID namespaces at the same time. If a process is
> > created in a PID namespace it is possible to influence the PID inside
> > and outside of the PID namespace. Details also in the corresponding
> > selftest.
> >
>
> > /*
> > * Verify that higher 32bits of exit_signal are unset and that
> > * it is a valid signal
> > @@ -2556,8 +2561,17 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> > .stack = args.stack,
> > .stack_size = args.stack_size,
> > .tls = args.tls,
> > + .set_tid = kargs->set_tid,
> > + .set_tid_size = args.set_tid_size,
> > };
>
> This is a bit ugly. And is it even well-defined? I mean, it's a bit
> similar to the "i = i++;". So it would be best to avoid.
>
> > + for (i = 0; i < args.set_tid_size; i++) {
> > + if (copy_from_user(&kargs->set_tid[i],
> > + u64_to_user_ptr(args.set_tid + (i * sizeof(args.set_tid))),
> > + sizeof(pid_t)))
> > + return -EFAULT;
> > + }
> > +
>
> If I'm reading this (and your test case) right, you expect the user
> pointer to point at an array of u64, and here you're copying the first
> half of each u64 to the pid_t array. That only works on little-endian.
>
> It seems more obvious (since I don't think there's any disagreement
> anywhere on sizeof(pid_t)) to expect the user pointer to point at an
> array of pid_t and then simply copy_from_user() the whole thing in one go.
Yes, that was wrong. I changed the test case to use an array of pid_t.
Adrian