2009-10-13 04:48:58

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call


Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call

To support application checkpoint/restart, a task must have the same pid it
had when it was checkpointed. When containers are nested, the tasks within
the containers exist in multiple pid namespaces and hence have multiple pids
to specify during restart.

This patchset implements a new system call, clone3() that lets a process
specify the pids of the child process.

Patches 1 through 7 are helper patches, needed for choosing a pid for the
child process.

PATCH 9 defines a prototype of the new system call. PATCH 10 adds some
documentation on the new system call, some/all of which will eventually
go into a man page.

Changelog[v8]:
- [Oren Laadan, Louis Rilling, KOSAKI Motohiro]
The name 'clone2()' is in use - renamed new syscall to clone3().
- [Oren Laadan] ->parent_tidptr and ->child_tidptr need to be 64bit.
- [Oren Laadan] Ensure that unused fields/flags in clone_struct are 0.
(Added [PATCH 7/10] to the patchset).

Changelog[v7]:
- [Peter Zijlstra, Arnd Bergmann]
Group the arguments to clone2() into a 'struct clone_arg' to
workaround the issue of exceeding 6 arguments to the system call.
Also define clone-flags as u64 to allow additional clone-flags.

Changelog[v6]:
- [Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds]
Change 'pid_set.pids' to 'pid_t pids[]' so sizeof(struct pid_set) is
constant across architectures (Patches 7, 8).
- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
'unum_pids < 0' check (Patches 7,8)
- (Pavel Machek) New patch (Patch 9) to add some documentation.

Changelog[v5]:
- Make 'pid_max' a property of pid_ns (Integrated Serge Hallyn's patch
into this set)
- (Eric Biederman): Avoid the new function, set_pidmap() - added
couple of checks on 'target_pid' in alloc_pidmap() itself.

=== IMPORTANT TODO:

clone() system call has another limitation - all but one, available bits in
clone-flags are in use and if more new clone-flags are needed, we will need
a variant of the clone() system call.

It appears to make sense to try and extend this new system call to address
this limitation as well. The requirements of a new clone system call could
then be summarized as:

- do everything clone() does today, and
- give application an ability to choose pids for the child process
in all ancestor pid namespaces, and
- allow more clone_flags

Contstraints:

- system-calls are restricted to 6 parameters and clone() already
takes 5 parameters, any extension to clone() interface would require
one or more copy_from_user(). (Not sure if copy_from_user() of ~40
bytes would have a significant impact on performance of clone()).

Based on these requirements and constraints, we explored a couple of system
call interfaces (in earlier versions of this patchset). Based on input from
Arnd Bergmann and others, the new interface of the system call is:

struct clone_struct {
u64 flags;
u64 child_stack;
u32 nr_pids;
u32 reserved1;
u64 parent_tid;
u64 child_tid;
u64 reserved2;
};

sys_clone3(struct clone_struct __user *cs, pid_t __user *pids)

Signed-off-by: Sukadev Bhattiprolu <[email protected]>


2009-10-13 04:49:27

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: [RFC][v8][PATCH 1/10]: Factor out code to allocate pidmap page



Subject: [RFC][v8][PATCH 1/10]: Factor out code to allocate pidmap page

To simplify alloc_pidmap(), move code to allocate a pid map page to a
separate function.

Changelog[v3]:
- Earlier version of patchset called alloc_pidmap_page() from two
places. But now its called from only one place. Even so, moving
this code out into a separate function simplifies alloc_pidmap().
Changelog[v2]:
- (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
-ENOMEM on error instead of -1.

Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Reviewed-by: Oren Laadan <[email protected]>
---
kernel/pid.c | 45 ++++++++++++++++++++++++++++++---------------
1 file changed, 30 insertions(+), 15 deletions(-)

Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c 2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/kernel/pid.c 2009-09-09 19:06:25.000000000 -0700
@@ -122,9 +122,35 @@ static void free_pidmap(struct upid *upi
atomic_inc(&map->nr_free);
}

+static int alloc_pidmap_page(struct pidmap *map)
+{
+ void *page;
+
+ if (likely(map->page))
+ return 0;
+
+ page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+ /*
+ * Free the page if someone raced with us installing it:
+ */
+ spin_lock_irq(&pidmap_lock);
+ if (map->page)
+ kfree(page);
+ else
+ map->page = page;
+ spin_unlock_irq(&pidmap_lock);
+
+ if (unlikely(!map->page))
+ return -ENOMEM;
+
+ return 0;
+}
+
static int alloc_pidmap(struct pid_namespace *pid_ns)
{
int i, offset, max_scan, pid, last = pid_ns->last_pid;
+ int rc;
struct pidmap *map;

pid = last + 1;
@@ -134,21 +160,10 @@ static int alloc_pidmap(struct pid_names
map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
for (i = 0; i <= max_scan; ++i) {
- if (unlikely(!map->page)) {
- void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
- /*
- * Free the page if someone raced with us
- * installing it:
- */
- spin_lock_irq(&pidmap_lock);
- if (map->page)
- kfree(page);
- else
- map->page = page;
- spin_unlock_irq(&pidmap_lock);
- if (unlikely(!map->page))
- break;
- }
+ rc = alloc_pidmap_page(map);
+ if (rc)
+ break;
+
if (likely(atomic_read(&map->nr_free))) {
do {
if (!test_and_set_bit(offset, map->page)) {

2009-10-13 04:49:49

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: [RFC][v8][PATCH 2/10]: Have alloc_pidmap() return actual error code



Subject: [RFC][v8][PATCH 2/10]: Have alloc_pidmap() return actual error code

alloc_pidmap() can fail either because all pid numbers are in use or
because memory allocation failed. With support for setting a specific
pid number, alloc_pidmap() would also fail if either the given pid
number is invalid or in use.

Rather than have callers assume -ENOMEM, have alloc_pidmap() return
the actual error.

Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Reviewed-by: Oren Laadan <[email protected]>
---
kernel/fork.c | 5 +++--
kernel/pid.c | 14 +++++++++-----
2 files changed, 12 insertions(+), 7 deletions(-)

Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c 2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/kernel/fork.c 2009-09-09 19:06:46.000000000 -0700
@@ -1110,10 +1110,11 @@ static struct task_struct *copy_process(
goto bad_fork_cleanup_io;

if (pid != &init_struct_pid) {
- retval = -ENOMEM;
pid = alloc_pid(p->nsproxy->pid_ns);
- if (!pid)
+ if (IS_ERR(pid)) {
+ retval = PTR_ERR(pid);
goto bad_fork_cleanup_io;
+ }

if (clone_flags & CLONE_NEWPID) {
retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c 2009-09-09 19:06:25.000000000 -0700
+++ linux-2.6/kernel/pid.c 2009-09-09 19:06:46.000000000 -0700
@@ -150,7 +150,7 @@ static int alloc_pidmap_page(struct pidm
static int alloc_pidmap(struct pid_namespace *pid_ns)
{
int i, offset, max_scan, pid, last = pid_ns->last_pid;
- int rc;
+ int rc = -EAGAIN;
struct pidmap *map;

pid = last + 1;
@@ -189,12 +189,14 @@ static int alloc_pidmap(struct pid_names
} else {
map = &pid_ns->pidmap[0];
offset = RESERVED_PIDS;
- if (unlikely(last == offset))
+ if (unlikely(last == offset)) {
+ rc = -EAGAIN;
break;
+ }
}
pid = mk_pid(pid_ns, map, offset);
}
- return -1;
+ return rc;
}

int next_pidmap(struct pid_namespace *pid_ns, int last)
@@ -263,8 +265,10 @@ struct pid *alloc_pid(struct pid_namespa
struct upid *upid;

pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
- if (!pid)
+ if (!pid) {
+ pid = ERR_PTR(-ENOMEM);
goto out;
+ }

tmp = ns;
for (i = ns->level; i >= 0; i--) {
@@ -299,7 +303,7 @@ out_free:
free_pidmap(pid->numbers + i);

kmem_cache_free(ns->pid_cachep, pid);
- pid = NULL;
+ pid = ERR_PTR(nr);
goto out;
}

2009-10-13 04:50:11

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property



From: Serge Hallyn <[email protected]>
Subject: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property

Remove the pid_max global, and make it a property of the
pid_namespace. When a pid_ns is created, it inherits
the parent's pid_ns.

Fixing up sysctl (trivial akin to ipc version, but
potentially tedious to get right for all CONFIG*
combinations) is left for later.

Changelog[v2]:
- Port to newer kernel
- Make pid_max a local variable in alloc_pidmap() to simplify code/patch

Signed-off-by: Serge Hallyn <[email protected]>
Signed-off-by: Sukadev Bhattiprolu <[email protected]>
---
include/linux/pid_namespace.h | 1 +
kernel/pid.c | 4 ++--
kernel/pid_namespace.c | 1 +
kernel/sysctl.c | 4 ++--
4 files changed, 6 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/pid_namespace.h
===================================================================
--- linux-2.6.orig/include/linux/pid_namespace.h 2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/include/linux/pid_namespace.h 2009-09-09 19:07:20.000000000 -0700
@@ -30,6 +30,7 @@ struct pid_namespace {
#ifdef CONFIG_BSD_PROCESS_ACCT
struct bsd_acct_struct *bacct;
#endif
+ int pid_max;
};

extern struct pid_namespace init_pid_ns;
Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c 2009-09-09 19:06:46.000000000 -0700
+++ linux-2.6/kernel/pid.c 2009-09-09 19:07:20.000000000 -0700
@@ -43,8 +43,6 @@ static struct hlist_head *pid_hash;
static int pidhash_shift;
struct pid init_struct_pid = INIT_STRUCT_PID;

-int pid_max = PID_MAX_DEFAULT;
-
#define RESERVED_PIDS 300

int pid_max_min = RESERVED_PIDS + 1;
@@ -78,6 +76,7 @@ struct pid_namespace init_pid_ns = {
.last_pid = 0,
.level = 0,
.child_reaper = &init_task,
+ .pid_max = PID_MAX_DEFAULT,
};
EXPORT_SYMBOL_GPL(init_pid_ns);

@@ -151,6 +150,7 @@ static int alloc_pidmap(struct pid_names
{
int i, offset, max_scan, pid, last = pid_ns->last_pid;
int rc = -EAGAIN;
+ int pid_max = pid_ns->pid_max;
struct pidmap *map;

pid = last + 1;
Index: linux-2.6/kernel/pid_namespace.c
===================================================================
--- linux-2.6.orig/kernel/pid_namespace.c 2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/kernel/pid_namespace.c 2009-09-09 19:07:20.000000000 -0700
@@ -87,6 +87,7 @@ static struct pid_namespace *create_pid_

kref_init(&ns->kref);
ns->level = level;
+ ns->pid_max = parent_pid_ns->pid_max;
ns->parent = get_pid_ns(parent_pid_ns);

set_bit(0, ns->pidmap[0].page);
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c 2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/kernel/sysctl.c 2009-09-09 19:07:20.000000000 -0700
@@ -55,6 +55,7 @@

#include <asm/uaccess.h>
#include <asm/processor.h>
+#include <linux/pid_namespace.h>

#ifdef CONFIG_X86
#include <asm/nmi.h>
@@ -78,7 +79,6 @@ extern int max_threads;
extern int core_uses_pid;
extern int suid_dumpable;
extern char core_pattern[];
-extern int pid_max;
extern int min_free_kbytes;
extern int pid_max_min, pid_max_max;
extern int sysctl_drop_caches;
@@ -670,7 +670,7 @@ static struct ctl_table kern_table[] = {
{
.ctl_name = KERN_PIDMAX,
.procname = "pid_max",
- .data = &pid_max,
+ .data = &init_pid_ns.pid_max,
.maxlen = sizeof (int),
.mode = 0644,
.proc_handler = &proc_dointvec_minmax,

2009-10-13 04:50:33

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: [RFC][v8][PATCH 4/10]: Add target_pid parameter to alloc_pidmap()



Subject: [RFC][v8][PATCH 4/10]: Add target_pid parameter to alloc_pidmap()

With support for setting a specific pid number for a process,
alloc_pidmap() will need a 'target_pid' parameter.

Changelog[v6]:
- Separate target_pid > 0 case to minimize the number of checks needed.
Changelog[v3]:
- (Eric Biederman): Avoid set_pidmap() function. Added couple of
checks for target_pid in alloc_pidmap() itself.
Changelog[v2]:
- (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
actually checks for 'pid <= 0' for completeness).

Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
---
kernel/pid.c | 26 +++++++++++++++++++++-----
1 file changed, 21 insertions(+), 5 deletions(-)

Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c 2009-09-09 19:07:20.000000000 -0700
+++ linux-2.6/kernel/pid.c 2009-09-10 10:23:27.000000000 -0700
@@ -146,16 +146,22 @@ static int alloc_pidmap_page(struct pidm
return 0;
}

-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid)
{
int i, offset, max_scan, pid, last = pid_ns->last_pid;
int rc = -EAGAIN;
int pid_max = pid_ns->pid_max;
struct pidmap *map;

- pid = last + 1;
- if (pid >= pid_max)
- pid = RESERVED_PIDS;
+ if (target_pid) {
+ pid = target_pid;
+ if (pid < 0 || pid >= pid_max)
+ return -EINVAL;
+ } else {
+ pid = last + 1;
+ if (pid >= pid_max)
+ pid = RESERVED_PIDS;
+ }
offset = pid & BITS_PER_PAGE_MASK;
map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
@@ -164,6 +170,15 @@ static int alloc_pidmap(struct pid_names
if (rc)
break;

+ if (target_pid) {
+ rc = -EBUSY;
+ if (!test_and_set_bit(offset, map->page)) {
+ atomic_dec(&map->nr_free);
+ rc = pid;
+ }
+ return rc;
+ }
+
if (likely(atomic_read(&map->nr_free))) {
do {
if (!test_and_set_bit(offset, map->page)) {
@@ -196,6 +211,7 @@ static int alloc_pidmap(struct pid_names
}
pid = mk_pid(pid_ns, map, offset);
}
+
return rc;
}

@@ -272,7 +288,7 @@ struct pid *alloc_pid(struct pid_namespa

tmp = ns;
for (i = ns->level; i >= 0; i--) {
- nr = alloc_pidmap(tmp);
+ nr = alloc_pidmap(tmp, 0);
if (nr < 0)
goto out_free;

2009-10-13 04:51:11

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: [RFC][v8][PATCH 5/10]: Add target_pids parameter to alloc_pid()



Subject: [RFC][v8][PATCH 5/10]: Add target_pids parameter to alloc_pid()

This parameter is currently NULL, but will be used in a follow-on patch.

Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Reviewed-by: Oren Laadan <[email protected]>
---
include/linux/pid.h | 2 +-
kernel/fork.c | 3 ++-
kernel/pid.c | 9 +++++++--
3 files changed, 10 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/pid.h
===================================================================
--- linux-2.6.orig/include/linux/pid.h 2009-09-10 10:19:20.000000000 -0700
+++ linux-2.6/include/linux/pid.h 2009-09-10 10:29:10.000000000 -0700
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
int next_pidmap(struct pid_namespace *pid_ns, int last);

-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
extern void free_pid(struct pid *pid);

/*
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c 2009-09-10 10:19:20.000000000 -0700
+++ linux-2.6/kernel/fork.c 2009-09-10 10:29:10.000000000 -0700
@@ -940,6 +940,7 @@ static struct task_struct *copy_process(
int retval;
struct task_struct *p;
int cgroup_callbacks_done = 0;
+ pid_t *target_pids = NULL;

if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
@@ -1110,7 +1111,7 @@ static struct task_struct *copy_process(
goto bad_fork_cleanup_io;

if (pid != &init_struct_pid) {
- pid = alloc_pid(p->nsproxy->pid_ns);
+ pid = alloc_pid(p->nsproxy->pid_ns, target_pids);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_io;
Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c 2009-09-10 10:23:27.000000000 -0700
+++ linux-2.6/kernel/pid.c 2009-09-10 10:29:10.000000000 -0700
@@ -272,13 +272,14 @@ void free_pid(struct pid *pid)
call_rcu(&pid->rcu, delayed_put_pid);
}

-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
{
struct pid *pid;
enum pid_type type;
int i, nr;
struct pid_namespace *tmp;
struct upid *upid;
+ int tpid;

pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
if (!pid) {
@@ -288,7 +289,11 @@ struct pid *alloc_pid(struct pid_namespa

tmp = ns;
for (i = ns->level; i >= 0; i--) {
- nr = alloc_pidmap(tmp, 0);
+ tpid = 0;
+ if (target_pids)
+ tpid = target_pids[i];
+
+ nr = alloc_pidmap(tmp, tpid);
if (nr < 0)
goto out_free;

2009-10-13 04:51:43

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: [RFC][v8][PATCH 6/10]: Add target_pids parameter to copy_process()



Subject: [RFC][v8][PATCH 6/10]: Add target_pids parameter to copy_process()

Add a 'target_pids' parameter to copy_process(). The new parameter will be
used in a follow-on patch when clone_with_pids() is implemented.

Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Reviewed-by: Oren Laadan <[email protected]>
---
kernel/fork.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c 2009-09-10 10:29:10.000000000 -0700
+++ linux-2.6/kernel/fork.c 2009-09-10 10:29:13.000000000 -0700
@@ -935,12 +935,12 @@ static struct task_struct *copy_process(
unsigned long stack_size,
int __user *child_tidptr,
struct pid *pid,
+ pid_t *target_pids,
int trace)
{
int retval;
struct task_struct *p;
int cgroup_callbacks_done = 0;
- pid_t *target_pids = NULL;

if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
@@ -1319,7 +1319,7 @@ struct task_struct * __cpuinit fork_idle
struct pt_regs regs;

task = copy_process(CLONE_VM, 0, idle_regs(&regs), 0, NULL,
- &init_struct_pid, 0);
+ &init_struct_pid, NULL, 0);
if (!IS_ERR(task))
init_idle(task, cpu);

@@ -1342,6 +1342,7 @@ long do_fork(unsigned long clone_flags,
struct task_struct *p;
int trace = 0;
long nr;
+ pid_t *target_pids = NULL;

/*
* Do some preliminary argument and permissions checking before we
@@ -1382,7 +1383,7 @@ long do_fork(unsigned long clone_flags,
trace = tracehook_prepare_clone(clone_flags);

p = copy_process(clone_flags, stack_start, regs, stack_size,
- child_tidptr, NULL, trace);
+ child_tidptr, NULL, target_pids, trace);
/*
* Do this prior waking up the new thread - the thread pointer
* might get invalid after that point, if the thread exits quickly.

2009-10-13 04:52:04

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: [RFC][v8][PATCH 7/10]: Check invalid clone flags



Subject: [RFC][v8][PATCH 7/10]: Check invalid clone flags

As pointed out by Oren Laadan, we want to ensure that unused bits in the
clone-flags remain unused and available for future. To ensure this, define
a mask of clone-flags and check the flags in the clone() system calls.

Changelog[v8]:
- New patch in set

Signed-off-by: Sukadev Bhattiprolu <[email protected]>

---
include/linux/sched.h | 10 ++++++++++
kernel/fork.c | 3 +++
2 files changed, 13 insertions(+)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h 2009-10-02 18:53:55.000000000 -0700
+++ linux-2.6/include/linux/sched.h 2009-10-02 19:58:21.000000000 -0700
@@ -29,6 +29,16 @@
#define CLONE_NEWNET 0x40000000 /* New network namespace */
#define CLONE_IO 0x80000000 /* Clone io context */

+#define VALID_CLONE_FLAGS (CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\
+ CLONE_SIGHAND | CLONE_PTRACE | CLONE_VFORK |\
+ CLONE_PARENT | CLONE_THREAD | CLONE_NEWNS |\
+ CLONE_SYSVSEM | CLONE_SETTLS |\
+ CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID |\
+ CLONE_DETACHED | CLONE_UNTRACED |\
+ CLONE_CHILD_SETTID | CLONE_STOPPED |\
+ CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\
+ CLONE_NEWPID | CLONE_NEWNET| CLONE_IO)
+
/*
* Scheduling policies
*/
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c 2009-10-02 19:00:08.000000000 -0700
+++ linux-2.6/kernel/fork.c 2009-10-02 19:57:36.000000000 -0700
@@ -942,6 +942,9 @@ static struct task_struct *copy_process(
struct task_struct *p;
int cgroup_callbacks_done = 0;

+ if (clone_flags & ~VALID_CLONE_FLAGS)
+ return ERR_PTR(-EINVAL);
+
if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);

2009-10-13 04:52:22

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: [RFC][v8][PATCH 8/10]: Define do_fork_with_pids()



Subject: [RFC][v8][PATCH 8/10]: Define do_fork_with_pids()

do_fork_with_pids() is same as do_fork(), except that it takes an
additional, 'pid_set', parameter. This parameter, currently unused,
specifies the set of target pids of the process in each of its pid
namespaces.

Changelog[v7]:
- Drop 'struct pid_set' object and pass in 'pid_t *target_pids'
instead of 'struct pid_set *'.

Changelog[v6]:
- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
is constant across architectures.
- (Nathan Lynch) Change 'pid_set.num_pids' to 'unsigned int'.

Changelog[v4]:
- Rename 'struct target_pid_set' to 'struct pid_set' since it may
be useful in other contexts.

Changelog[v3]:
- Fix "long-line" warning from checkpatch.pl

Changelog[v2]:
- To facilitate moving architecture-inpdendent code to kernel/fork.c
pass in 'struct target_pid_set __user *' to do_fork_with_pids()
rather than 'pid_t *' (next patch moves the arch-independent
code to kernel/fork.c)

Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Reviewed-by: Oren Laadan <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/fork.c | 17 +++++++++++++++--
2 files changed, 18 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h 2009-10-12 18:58:50.000000000 -0700
+++ linux-2.6/include/linux/sched.h 2009-10-12 18:59:04.000000000 -0700
@@ -2064,6 +2064,9 @@ extern int disallow_signal(int);

extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
+ unsigned long, int __user *, int __user *,
+ unsigned int, pid_t __user *);
struct task_struct *fork_idle(int);

extern void set_task_comm(struct task_struct *tsk, char *from);
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c 2009-10-12 18:58:50.000000000 -0700
+++ linux-2.6/kernel/fork.c 2009-10-12 19:13:24.000000000 -0700
@@ -1335,12 +1335,14 @@ struct task_struct * __cpuinit fork_idle
* It copies the process, and if successful kick-starts
* it and waits for it to finish using the VM if required.
*/
-long do_fork(unsigned long clone_flags,
+long do_fork_with_pids(unsigned long clone_flags,
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
int __user *parent_tidptr,
- int __user *child_tidptr)
+ int __user *child_tidptr,
+ unsigned int num_pids,
+ pid_t __user *upids)
{
struct task_struct *p;
int trace = 0;
@@ -1443,6 +1445,17 @@ long do_fork(unsigned long clone_flags,
return nr;
}

+long do_fork(unsigned long clone_flags,
+ unsigned long stack_start,
+ struct pt_regs *regs,
+ unsigned long stack_size,
+ int __user *parent_tidptr,
+ int __user *child_tidptr)
+{
+ return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
+ parent_tidptr, child_tidptr, 0, NULL);
+}
+
#ifndef ARCH_MIN_MMSTRUCT_ALIGN
#define ARCH_MIN_MMSTRUCT_ALIGN 0
#endif

2009-10-13 04:54:12

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: [RFC][v8][PATCH 9/10]: Define clone3() syscall



Subject: [RFC][v8][PATCH 9/10]: Define clone3() syscall

Container restart requires that a task have the same pid it had when it was
checkpointed. When containers are nested the tasks within the containers
exist in multiple pid namespaces and hence have multiple pids to specify
during restart.

clone3(), intended for use during restart, is the same as clone(), except
that it takes a 'pids' paramter. This parameter lets caller choose
specific pid numbers for the child process, in the process's active and
ancestor pid namespaces. (Descendant pid namespaces in general don't matter
since processes don't have pids in them anyway, but see comments in
copy_target_pids() regarding CLONE_NEWPID).

Clone2() system call also attempts to address a second limitation of the
clone() system call. clone() is restricted to 32 clone flags and most (all ?)
of these are in use. If a new clone flag is needed, we will be forced to
define a new variant of the clone() system call.

To prevent unprivileged processes from misusing this interface, clone3()
currently needs CAP_SYS_ADMIN, when the 'pids' parameter is non-NULL.

See Documentation/clone3 in next patch for more details of clone3() and an
example of its usage.

NOTE:
System calls are restricted to 6 parameters and the number and sizes
of parameters needed for sys_clone3() exceed 6 integers. The new
prototype works around this restriction while providing some
flexibility if clone3() needs to be further extended in the future.
TODO:
- May need additional sanity checks in do_fork_with_pids().

Changelog[v8]
- [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg'
must be 64-bit.
- clone2() is in use in IA64. Rename system call to clone3().

Changelog[v7]:
- [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
and group parameters into a new 'struct clone_struct' object.

Changelog[v6]:
- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
is constant across architectures.
- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
'unum_pids < 0' check.

Changelog[v4]:
- (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'

Changelog[v3]:
- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
in the target_pids[] list and setting it 0. See copy_target_pids()).
- (Oren Laadan) Specified target pids should apply only to youngest
pid-namespaces (see copy_target_pids())
- (Matt Helsley) Update patch description.

Changelog[v2]:
- Remove unnecessary printk and add a note to callers of
copy_target_pids() to free target_pids.
- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
'num_pids == 0' (fall back to normal clone()).
- Move arch-independent code (sanity checks and copy-in of target-pids)
into kernel/fork.c and simplify sys_clone_with_pids()

Changelog[v1]:
- Fixed some compile errors (had fixed these errors earlier in my
git tree but had not refreshed patches before emailing them)

Signed-off-by: Sukadev Bhattiprolu <[email protected]>
---
arch/x86/include/asm/syscalls.h | 1
arch/x86/include/asm/unistd_32.h | 1
arch/x86/kernel/process_32.c | 34 +++++++++++++
arch/x86/kernel/syscall_table_32.S | 1
include/linux/types.h | 10 +++
kernel/fork.c | 96 ++++++++++++++++++++++++++++++++++++-
6 files changed, 142 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/types.h
===================================================================
--- linux-2.6.orig/include/linux/types.h 2009-10-12 19:13:24.000000000 -0700
+++ linux-2.6/include/linux/types.h 2009-10-12 19:15:08.000000000 -0700
@@ -204,6 +204,16 @@ struct ustat {
char f_fpack[6];
};

+struct clone_struct {
+ u64 flags;
+ u64 child_stack;
+ u32 nr_pids;
+ u32 reserved1;
+ u64 parent_tid;
+ u64 child_tid;
+ u64 reserved2;
+};
+
#endif /* __KERNEL__ */
#endif /* __ASSEMBLY__ */
#endif /* _LINUX_TYPES_H */
Index: linux-2.6/arch/x86/include/asm/syscalls.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/syscalls.h 2009-10-12 19:13:24.000000000 -0700
+++ linux-2.6/arch/x86/include/asm/syscalls.h 2009-10-12 19:15:08.000000000 -0700
@@ -55,6 +55,7 @@ struct sel_arg_struct;
struct oldold_utsname;
struct old_utsname;

+asmlinkage long sys_clone3(struct clone_struct __user *cs, pid_t *pids);
asmlinkage long sys_mmap2(unsigned long, unsigned long, unsigned long,
unsigned long, unsigned long, unsigned long);
asmlinkage int old_mmap(struct mmap_arg_struct __user *);
Index: linux-2.6/arch/x86/include/asm/unistd_32.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/unistd_32.h 2009-10-12 19:13:24.000000000 -0700
+++ linux-2.6/arch/x86/include/asm/unistd_32.h 2009-10-12 19:15:08.000000000 -0700
@@ -342,6 +342,7 @@
#define __NR_pwritev 334
#define __NR_rt_tgsigqueueinfo 335
#define __NR_perf_counter_open 336
+#define __NR_clone3 337

#ifdef __KERNEL__

Index: linux-2.6/arch/x86/kernel/process_32.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/process_32.c 2009-10-12 19:13:24.000000000 -0700
+++ linux-2.6/arch/x86/kernel/process_32.c 2009-10-12 19:54:34.000000000 -0700
@@ -443,6 +443,40 @@ int sys_clone(struct pt_regs *regs)
return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
}

+asmlinkage long
+sys_clone3(struct clone_struct __user *ucs, pid_t __user *pids)
+{
+ int rc;
+ struct clone_struct kcs;
+ unsigned long clone_flags;
+ int __user *parent_tid_ptr;
+ int __user *child_tid_ptr;
+ unsigned long __user child_stack_base;
+ struct pt_regs *regs;
+
+ rc = copy_from_user(&kcs, ucs, sizeof(kcs));
+ if (rc)
+ return -EFAULT;
+
+ if (kcs.reserved1 || kcs.reserved2)
+ return -EINVAL;
+
+ /*
+ * TODO: Convert clone_flags to 64-bit
+ */
+ clone_flags = (unsigned long)kcs.flags;
+ child_stack_base = (unsigned long)kcs.child_stack;
+ parent_tid_ptr = (int *)ucs->parent_tid;
+ child_tid_ptr = (int *)ucs->child_tid;
+ regs = task_pt_regs(current);
+
+ if (!child_stack_base)
+ child_stack_base = user_stack_pointer(regs);
+
+ return do_fork_with_pids(clone_flags, child_stack_base, regs, 0,
+ parent_tid_ptr, child_tid_ptr, kcs.nr_pids, pids);
+}
+
/*
* sys_execve() executes a new program.
*/
Index: linux-2.6/arch/x86/kernel/syscall_table_32.S
===================================================================
--- linux-2.6.orig/arch/x86/kernel/syscall_table_32.S 2009-10-12 19:13:24.000000000 -0700
+++ linux-2.6/arch/x86/kernel/syscall_table_32.S 2009-10-12 19:15:08.000000000 -0700
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
.long sys_pwritev
.long sys_rt_tgsigqueueinfo /* 335 */
.long sys_perf_counter_open
+ .long sys_clone3
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c 2009-10-12 19:13:24.000000000 -0700
+++ linux-2.6/kernel/fork.c 2009-10-12 19:15:08.000000000 -0700
@@ -1330,6 +1330,86 @@ struct task_struct * __cpuinit fork_idle
}

/*
+ * If user specified any 'target-pids' in @upid_setp, copy them from
+ * user and return a pointer to a local copy of the list of pids. The
+ * caller must free the list, when they are done using it.
+ *
+ * If user did not specify any target pids, return NULL (caller should
+ * treat this like normal clone).
+ *
+ * On any errors, return the error code
+ */
+static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids)
+{
+ int j;
+ int rc;
+ int size;
+ int knum_pids; /* # of pids needed in kernel */
+ pid_t *target_pids;
+
+ if (!unum_pids)
+ return NULL;
+
+ knum_pids = task_pid(current)->level + 1;
+ if (unum_pids > knum_pids)
+ return ERR_PTR(-EINVAL);
+
+ /*
+ * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
+ * and set it to 0. This last entry in target_pids[] corresponds to the
+ * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
+ * specified. If CLONE_NEWPID was not specified, this last entry will
+ * simply be ignored.
+ */
+ target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
+ if (!target_pids)
+ return ERR_PTR(-ENOMEM);
+
+ /*
+ * A process running in a level 2 pid namespace has three pid namespaces
+ * and hence three pid numbers. If this process is checkpointed,
+ * information about these three namespaces are saved. We refer to these
+ * namespaces as 'known namespaces'.
+ *
+ * If this checkpointed process is however restarted in a level 3 pid
+ * namespace, the restarted process has an extra ancestor pid namespace
+ * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
+ *
+ * During restart, the process requests specific pids for its 'known
+ * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
+ *
+ * Since the requested-pids correspond to 'known namespaces' and since
+ * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
+ * namespaces', copy requested pids to the back-end of target_pids[]
+ * (i.e before the last entry for CLONE_NEWPID mentioned above).
+ * Any entries in target_pids[] not corresponding to a requested pid
+ * will be set to zero and kernel assigns a pid in those namespaces.
+ *
+ * NOTE: The order of pids in target_pids[] is oldest pid namespace to
+ * youngest (target_pids[0] corresponds to init_pid_ns). i.e.
+ * the order is:
+ *
+ * - pids for 'unknown-namespaces' (if any)
+ * - pids for 'known-namespaces' (requested pids)
+ * - 0 in the last entry (for CLONE_NEWPID).
+ */
+ j = knum_pids - unum_pids;
+ size = unum_pids * sizeof(pid_t);
+
+ rc = copy_from_user(&target_pids[j], upids, size);
+ if (rc) {
+ rc = -EFAULT;
+ goto out_free;
+ }
+
+ return target_pids;
+
+out_free:
+ kfree(target_pids);
+ return ERR_PTR(rc);
+}
+
+/*
* Ok, this is the main fork-routine.
*
* It copies the process, and if successful kick-starts
@@ -1347,7 +1427,7 @@ long do_fork_with_pids(unsigned long clo
struct task_struct *p;
int trace = 0;
long nr;
- pid_t *target_pids = NULL;
+ pid_t *target_pids;

/*
* Do some preliminary argument and permissions checking before we
@@ -1381,6 +1461,16 @@ long do_fork_with_pids(unsigned long clo
}
}

+ target_pids = copy_target_pids(num_pids, upids);
+ if (target_pids) {
+ if (IS_ERR(target_pids))
+ return PTR_ERR(target_pids);
+
+ nr = -EPERM;
+ if (!capable(CAP_SYS_ADMIN))
+ goto out_free;
+ }
+
/*
* When called from kernel_thread, don't do user tracing stuff.
*/
@@ -1442,6 +1532,10 @@ long do_fork_with_pids(unsigned long clo
} else {
nr = PTR_ERR(p);
}
+
+out_free:
+ kfree(target_pids);
+
return nr;
}

2009-10-13 04:55:35

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: [RFC][v8][PATCH 10/10]: Document clone3() syscall



Subject: [RFC][v8][PATCH 10/10]: Document clone3() syscall

This gives a brief overview of the clone3() system call. We should
eventually describe more details in existing clone(2) man page or in
a new man page.

Changelog[v8]:
- clone2() is already in use in IA64. Rename syscall to clone3()
- Add notes to say that we return -EINVAL if invalid clone flags
are specified or if the reserved fields are not 0.
Changelog[v7]:
- Rename clone_with_pids() to clone2()
- Changes to reflect new prototype of clone2() (using clone_struct).

Signed-off-by: Sukadev Bhattiprolu <[email protected]>
---
Documentation/clone2 | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 89 insertions(+)

Index: linux-2.6/Documentation/clone2
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/clone2 2009-10-12 19:54:38.000000000 -0700
@@ -0,0 +1,89 @@
+
+struct clone_struct {
+ u64 flags;
+ u64 child_stack;
+ u32 nr_pids;
+ u32 reserved1;
+ u64 parent_tid;
+ u64 child_tid;
+ u64 reserved2;
+};
+
+clone3(struct clone_struct * __user clone_args, pid_t * __user pids)
+
+ In addition to doing everything that clone() system call does,
+ the clone3() system call:
+
+ - allows additional clone flags (all 32 bits in the flags
+ parameter to clone() are in use)
+
+ - allows user to specify a pid for the child process in its
+ active and ancestor pid name spaces.
+
+ This system call is meant to be used when restarting an application
+ from a checkpoint. Such restart requires that the processes in the
+ application have the same pids they had when the application was
+ checkpointed. When containers are nested, the processes within the
+ containers exist in multiple pid namespaces and hence have multiple
+ pids to specify during restart.
+
+ The @pids defines the set of pids that should be assigned to the child
+ process in its active and ancestor pid name spaces. The descendant pid
+ namespaces do not matter since a process does not have a pid in
+ descendant namespaces, unless the process is in a new pid namespace
+ in which case the process is a container-init (and must have the pid 1
+ in that namespace).
+
+ See CLONE_NEWPID section of clone(2) man page for details about pid
+ namespaces.
+
+ The order pids in @pids corresponds to the nesting order of pid-
+ namespaces, with @pids[0] corresponding to the init_pid_ns.
+
+ If a pid in the @pids list is 0, the kernel will assign the next
+ available pid in the pid namespace, for the process.
+
+ If a pid in the @pids list is non-zero, the kernel tries to assign
+ the specified pid in that namespace. If that pid is already in use
+ by another process, the system call fails with -EBUSY.
+
+ On success, the system call returns the pid of the child process in
+ the parent's active pid namespace.
+
+ On failure, clone3() returns -1 and sets 'errno' to one of following
+ values (the child process is not created).
+
+ EPERM Caller does not have the SYS_ADMIN privilege needed to excute
+ this call.
+
+ EINVAL The number of pids specified in 'clone_struct.nr_pids' exceeds
+ the current nesting level of parent process
+
+ EINVAL Not all specified clone-flags are valid.
+
+ EINVAL The reserved fields in the clone_struct argument are not 0.
+
+ EBUSY A requested pid is in use by another process in that name space.
+
+Example:
+
+ pid_t pids[] = { 77, 99 };
+ struct clone_struct cs;
+
+ cs.flags = (u64) SIGCHLD;
+ cs.child_stack = (u64) setup_child_stack();
+ cs.nr_pids = 2;
+ cs.parent_tid = 0LL;
+ cs.child_tid = 0LL;
+
+ rc = syscall(__NR_clone3, &cs, pids);
+
+ if (rc < 0) {
+ perror("clone3()");
+ exit(1);
+ } else if (rc) {
+ /* Parent */
+ } else {
+ /* Child */
+ }
+

2009-10-13 05:20:36

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property

On Mon, Oct 12, 2009 at 09:50:41PM -0700, Sukadev Bhattiprolu wrote:
> From: Serge Hallyn <[email protected]>
> Subject: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
>
> Remove the pid_max global, and make it a property of the
> pid_namespace. When a pid_ns is created, it inherits
> the parent's pid_ns.
>
> Fixing up sysctl (trivial akin to ipc version, but
> potentially tedious to get right for all CONFIG*
> combinations) is left for later.

pid_max is not pid_ns property, it's just sysctl to limit cpu time
during bitmap search. Just like vm.max_map_count limits kernel memory
used by VMAs.

2009-10-13 11:57:35

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 4/10]: Add target_pid parameter to alloc_pidmap()

Sukadev Bhattiprolu wrote:
>
> Subject: [RFC][v8][PATCH 4/10]: Add target_pid parameter to alloc_pidmap()
>

[ snip ]

> @@ -146,16 +146,22 @@ static int alloc_pidmap_page(struct pidm
> return 0;
> }
>
> -static int alloc_pidmap(struct pid_namespace *pid_ns)
> +static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid)

Please no! Better create another function, that will just
atomic_test_and_set() the proper bit in the map. This one
is heavy enough already.

2009-10-13 13:14:07

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property

Sukadev Bhattiprolu wrote:
>
> From: Serge Hallyn <[email protected]>
> Subject: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
>
> Remove the pid_max global, and make it a property of the
> pid_namespace. When a pid_ns is created, it inherits
> the parent's pid_ns.
>
> Fixing up sysctl (trivial akin to ipc version, but
> potentially tedious to get right for all CONFIG*
> combinations) is left for later.
>
> Changelog[v2]:
> - Port to newer kernel
> - Make pid_max a local variable in alloc_pidmap() to simplify code/patch
>
> Signed-off-by: Serge Hallyn <[email protected]>
> Signed-off-by: Sukadev Bhattiprolu <[email protected]>

Not that I'm about to slow down or block the process, but...
frankly I don't see the reason for doing so. Why should we?
Especially taking into account, that we essentially cannot
change thin in the namespace level 3 and deeper?

2009-10-13 16:09:14

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property

Quoting Pavel Emelyanov ([email protected]):
> Sukadev Bhattiprolu wrote:
> >
> > From: Serge Hallyn <[email protected]>
> > Subject: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property
> >
> > Remove the pid_max global, and make it a property of the
> > pid_namespace. When a pid_ns is created, it inherits
> > the parent's pid_ns.
> >
> > Fixing up sysctl (trivial akin to ipc version, but
> > potentially tedious to get right for all CONFIG*
> > combinations) is left for later.
> >
> > Changelog[v2]:
> > - Port to newer kernel
> > - Make pid_max a local variable in alloc_pidmap() to simplify code/patch
> >
> > Signed-off-by: Serge Hallyn <[email protected]>
> > Signed-off-by: Sukadev Bhattiprolu <[email protected]>
>
> Not that I'm about to slow down or block the process, but...

This patch isn't a core part of the clone_with_pid functionality,
just something Eric has asked for. So I don't object to dropping
it. But I disagree with Alexey's claim that this isn't a namespace
property. It should be.

> frankly I don't see the reason for doing so. Why should we?
> Especially taking into account, that we essentially cannot
> change thin in the namespace level 3 and deeper?

What do you mean by that? With this patchset we're not, it's
true, but we trivially can - even now, userspace can simply not
give the container CAP_SYS_ADMIN or write access to the sysctl
so they can't do any more CLONE_NEWPIDS or change the sysctl.

-serge

2009-10-13 16:15:42

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property

> This patch isn't a core part of the clone_with_pid functionality,
> just something Eric has asked for. So I don't object to dropping
> it. But I disagree with Alexey's claim that this isn't a namespace
> property. It should be.

OK

>> frankly I don't see the reason for doing so. Why should we?
>> Especially taking into account, that we essentially cannot
>> change thin in the namespace level 3 and deeper?
>
> What do you mean by that? With this patchset we're not, it's
> true, but we trivially can - even now, userspace can simply not
> give the container CAP_SYS_ADMIN or write access to the sysctl
> so they can't do any more CLONE_NEWPIDS or change the sysctl.

It's a misprint - I meant "level 2 and deeper". Sysctl is
only pointing at the init_pid_ns variable.

> -serge
>

2009-10-13 16:29:06

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property

Quoting Pavel Emelyanov ([email protected]):
> > This patch isn't a core part of the clone_with_pid functionality,
> > just something Eric has asked for. So I don't object to dropping
> > it. But I disagree with Alexey's claim that this isn't a namespace
> > property. It should be.
>
> OK
>
> >> frankly I don't see the reason for doing so. Why should we?
> >> Especially taking into account, that we essentially cannot
> >> change thin in the namespace level 3 and deeper?
> >
> > What do you mean by that? With this patchset we're not, it's
> > true, but we trivially can - even now, userspace can simply not
> > give the container CAP_SYS_ADMIN or write access to the sysctl
> > so they can't do any more CLONE_NEWPIDS or change the sysctl.
>
> It's a misprint - I meant "level 2 and deeper". Sysctl is
> only pointing at the init_pid_ns variable.

Right, and I'm saying that's to be fixed up as with all other
containerized sysctl's. You're right that this patch doesn't
solve that problem, but you seem to be arguing that it bc it's
not done in this patch, we should act as though it can't be
done.

But again, maybe we're best off dropping this patch (sorry, Suka,
I had suggested you add it...) and focusing on the rest of the set
for now.

thanks,
-serge

2009-10-13 18:36:40

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 7/10]: Check invalid clone flags



Sukadev Bhattiprolu wrote:
>
> Subject: [RFC][v8][PATCH 7/10]: Check invalid clone flags
>
> As pointed out by Oren Laadan, we want to ensure that unused bits in the
> clone-flags remain unused and available for future. To ensure this, define
> a mask of clone-flags and check the flags in the clone() system calls.
>
> Changelog[v8]:
> - New patch in set
>
> Signed-off-by: Sukadev Bhattiprolu <[email protected]>
>
> ---
> include/linux/sched.h | 10 ++++++++++
> kernel/fork.c | 3 +++
> 2 files changed, 13 insertions(+)
>
> Index: linux-2.6/include/linux/sched.h
> ===================================================================
> --- linux-2.6.orig/include/linux/sched.h 2009-10-02 18:53:55.000000000 -0700
> +++ linux-2.6/include/linux/sched.h 2009-10-02 19:58:21.000000000 -0700
> @@ -29,6 +29,16 @@
> #define CLONE_NEWNET 0x40000000 /* New network namespace */
> #define CLONE_IO 0x80000000 /* Clone io context */
>
> +#define VALID_CLONE_FLAGS (CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\
> + CLONE_SIGHAND | CLONE_PTRACE | CLONE_VFORK |\
> + CLONE_PARENT | CLONE_THREAD | CLONE_NEWNS |\
> + CLONE_SYSVSEM | CLONE_SETTLS |\
> + CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID |\
> + CLONE_DETACHED | CLONE_UNTRACED |\
> + CLONE_CHILD_SETTID | CLONE_STOPPED |\
> + CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\
> + CLONE_NEWPID | CLONE_NEWNET| CLONE_IO)
> +
> /*
> * Scheduling policies
> */
> Index: linux-2.6/kernel/fork.c
> ===================================================================
> --- linux-2.6.orig/kernel/fork.c 2009-10-02 19:00:08.000000000 -0700
> +++ linux-2.6/kernel/fork.c 2009-10-02 19:57:36.000000000 -0700
> @@ -942,6 +942,9 @@ static struct task_struct *copy_process(
> struct task_struct *p;
> int cgroup_callbacks_done = 0;
>

We can safely apply these tests to clone3(), because it is a new syscall.

However, I don't know if applying it to clone() can break existing
application that may already be (incorrectly) using invalid flags ?

Oren.

> + if (clone_flags & ~VALID_CLONE_FLAGS)
> + return ERR_PTR(-EINVAL);
> +
> if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
> return ERR_PTR(-EINVAL);
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/containers

2009-10-13 18:47:35

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall



Sukadev Bhattiprolu wrote:
>
> Subject: [RFC][v8][PATCH 9/10]: Define clone3() syscall
>
> Container restart requires that a task have the same pid it had when it was
> checkpointed. When containers are nested the tasks within the containers
> exist in multiple pid namespaces and hence have multiple pids to specify
> during restart.
>
> clone3(), intended for use during restart, is the same as clone(), except
> that it takes a 'pids' paramter. This parameter lets caller choose
> specific pid numbers for the child process, in the process's active and
> ancestor pid namespaces. (Descendant pid namespaces in general don't matter
> since processes don't have pids in them anyway, but see comments in
> copy_target_pids() regarding CLONE_NEWPID).
>
> Clone2() system call also attempts to address a second limitation of the
> clone() system call. clone() is restricted to 32 clone flags and most (all ?)
> of these are in use. If a new clone flag is needed, we will be forced to
> define a new variant of the clone() system call.
>
> To prevent unprivileged processes from misusing this interface, clone3()
> currently needs CAP_SYS_ADMIN, when the 'pids' parameter is non-NULL.
>
> See Documentation/clone3 in next patch for more details of clone3() and an
> example of its usage.
>
> NOTE:
> System calls are restricted to 6 parameters and the number and sizes
> of parameters needed for sys_clone3() exceed 6 integers. The new
> prototype works around this restriction while providing some
> flexibility if clone3() needs to be further extended in the future.
> TODO:
> - May need additional sanity checks in do_fork_with_pids().
>
> Changelog[v8]
> - [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg'
> must be 64-bit.
> - clone2() is in use in IA64. Rename system call to clone3().
>
> Changelog[v7]:
> - [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
> and group parameters into a new 'struct clone_struct' object.
>
> Changelog[v6]:
> - (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
> Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
> is constant across architectures.
> - (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
> 'unum_pids < 0' check.
>
> Changelog[v4]:
> - (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'
>
> Changelog[v3]:
> - (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
> in the target_pids[] list and setting it 0. See copy_target_pids()).
> - (Oren Laadan) Specified target pids should apply only to youngest
> pid-namespaces (see copy_target_pids())
> - (Matt Helsley) Update patch description.
>
> Changelog[v2]:
> - Remove unnecessary printk and add a note to callers of
> copy_target_pids() to free target_pids.
> - (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
> - (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
> 'num_pids == 0' (fall back to normal clone()).
> - Move arch-independent code (sanity checks and copy-in of target-pids)
> into kernel/fork.c and simplify sys_clone_with_pids()
>
> Changelog[v1]:
> - Fixed some compile errors (had fixed these errors earlier in my
> git tree but had not refreshed patches before emailing them)
>
> Signed-off-by: Sukadev Bhattiprolu <[email protected]>
> ---
> arch/x86/include/asm/syscalls.h | 1
> arch/x86/include/asm/unistd_32.h | 1
> arch/x86/kernel/process_32.c | 34 +++++++++++++
> arch/x86/kernel/syscall_table_32.S | 1
> include/linux/types.h | 10 +++
> kernel/fork.c | 96 ++++++++++++++++++++++++++++++++++++-
> 6 files changed, 142 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/include/linux/types.h
> ===================================================================
> --- linux-2.6.orig/include/linux/types.h 2009-10-12 19:13:24.000000000 -0700
> +++ linux-2.6/include/linux/types.h 2009-10-12 19:15:08.000000000 -0700
> @@ -204,6 +204,16 @@ struct ustat {
> char f_fpack[6];
> };
>
> +struct clone_struct {
> + u64 flags;
> + u64 child_stack;
> + u32 nr_pids;
> + u32 reserved1;
> + u64 parent_tid;
> + u64 child_tid;
> + u64 reserved2;
> +};
> +
> #endif /* __KERNEL__ */
> #endif /* __ASSEMBLY__ */
> #endif /* _LINUX_TYPES_H */
> Index: linux-2.6/arch/x86/include/asm/syscalls.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/syscalls.h 2009-10-12 19:13:24.000000000 -0700
> +++ linux-2.6/arch/x86/include/asm/syscalls.h 2009-10-12 19:15:08.000000000 -0700
> @@ -55,6 +55,7 @@ struct sel_arg_struct;
> struct oldold_utsname;
> struct old_utsname;
>
> +asmlinkage long sys_clone3(struct clone_struct __user *cs, pid_t *pids);
> asmlinkage long sys_mmap2(unsigned long, unsigned long, unsigned long,
> unsigned long, unsigned long, unsigned long);
> asmlinkage int old_mmap(struct mmap_arg_struct __user *);
> Index: linux-2.6/arch/x86/include/asm/unistd_32.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/unistd_32.h 2009-10-12 19:13:24.000000000 -0700
> +++ linux-2.6/arch/x86/include/asm/unistd_32.h 2009-10-12 19:15:08.000000000 -0700
> @@ -342,6 +342,7 @@
> #define __NR_pwritev 334
> #define __NR_rt_tgsigqueueinfo 335
> #define __NR_perf_counter_open 336
> +#define __NR_clone3 337
>
> #ifdef __KERNEL__
>
> Index: linux-2.6/arch/x86/kernel/process_32.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/process_32.c 2009-10-12 19:13:24.000000000 -0700
> +++ linux-2.6/arch/x86/kernel/process_32.c 2009-10-12 19:54:34.000000000 -0700
> @@ -443,6 +443,40 @@ int sys_clone(struct pt_regs *regs)
> return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
> }
>
> +asmlinkage long
> +sys_clone3(struct clone_struct __user *ucs, pid_t __user *pids)
> +{
> + int rc;
> + struct clone_struct kcs;
> + unsigned long clone_flags;
> + int __user *parent_tid_ptr;
> + int __user *child_tid_ptr;
> + unsigned long __user child_stack_base;
> + struct pt_regs *regs;
> +
> + rc = copy_from_user(&kcs, ucs, sizeof(kcs));
> + if (rc)
> + return -EFAULT;
> +
> + if (kcs.reserved1 || kcs.reserved2)
> + return -EINVAL;
> +
> + /*
> + * TODO: Convert clone_flags to 64-bit
> + */
> + clone_flags = (unsigned long)kcs.flags;

On 32-bit arch this would dismiss upper half of the flags :(
Is there a reason not to use 64-bit ('u64' or 'long long') now ?

[...]

Oren.

2009-10-13 20:52:52

by Roland McGrath

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Some userland debugging things and so forth like to look at the clone_flags
argument, so that is kept simpler for them if it stays in a register
(i.e. its own argument) rather than a user pointer fetch for that argument.
Any problem with:

sys_clone3(unsigned long clone_flags,
struct clone_struct __user *cs, pid_t __user *pids)

?

That also has the side benefit that instead of non-ia64 users forever
asking, "Why is it clone3 when there is no clone2?" you can instead pretend
that it follows the "clone3 because it takes three arguments" convention. ;-)

Btw, IMHO "struct foo_struct" is one of the lamest naming conventions ever.
How about "struct clone_args"?

Also, if you were to replace:

u64 child_stack;

with:

u64 child_stack_base;
u64 child_stack_size;

and use in sys_clone3 (for most arch's):

child_stack_ptr = kcs.child_stack_base + kcs.child_stack_size;

then the same clone3 interface would cover ia64 as well.


Thanks,
Roland

2009-10-13 23:27:01

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Roland McGrath [[email protected]] wrote:
| Some userland debugging things and so forth like to look at the clone_flags
| argument, so that is kept simpler for them if it stays in a register
| (i.e. its own argument) rather than a user pointer fetch for that argument.
| Any problem with:
|
| sys_clone3(unsigned long clone_flags,
| struct clone_struct __user *cs, pid_t __user *pids)
|
| ?

My only concern is the support of 64-bit clone flags on 32-bit architectures.
We would need an additional register/parameter, clone_flags_high ? Also,
hopefully we won't need more than 64 flags, but if we do, the plan AFACIT,
was that we could use one of the reserved fields.

|
| That also has the side benefit that instead of non-ia64 users forever
| asking, "Why is it clone3 when there is no clone2?" you can instead pretend
| that it follows the "clone3 because it takes three arguments" convention. ;-)
|
| Btw, IMHO "struct foo_struct" is one of the lamest naming conventions ever.
| How about "struct clone_args"?

Sure :-) In earlier version of patches I had mixed up clone_struct and
clone_arg in comments/descriptions and cleaned up in this version.
|
| Also, if you were to replace:
|
| u64 child_stack;
|
| with:
|
| u64 child_stack_base;
| u64 child_stack_size;
|
| and use in sys_clone3 (for most arch's):
|
| child_stack_ptr = kcs.child_stack_base + kcs.child_stack_size;
|
| then the same clone3 interface would cover ia64 as well.

Ok

|
|
| Thanks,
| Roland

2009-10-13 23:38:12

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 7/10]: Check invalid clone flags

Oren Laadan [[email protected]] wrote:
|
|
| Sukadev Bhattiprolu wrote:
| >
| > Subject: [RFC][v8][PATCH 7/10]: Check invalid clone flags
| >
| > As pointed out by Oren Laadan, we want to ensure that unused bits in the
| > clone-flags remain unused and available for future. To ensure this, define
| > a mask of clone-flags and check the flags in the clone() system calls.
| >
| > Changelog[v8]:
| > - New patch in set
| >
| > Signed-off-by: Sukadev Bhattiprolu <[email protected]>
| >
| > ---
| > include/linux/sched.h | 10 ++++++++++
| > kernel/fork.c | 3 +++
| > 2 files changed, 13 insertions(+)
| >
| > Index: linux-2.6/include/linux/sched.h
| > ===================================================================
| > --- linux-2.6.orig/include/linux/sched.h 2009-10-02 18:53:55.000000000 -0700
| > +++ linux-2.6/include/linux/sched.h 2009-10-02 19:58:21.000000000 -0700
| > @@ -29,6 +29,16 @@
| > #define CLONE_NEWNET 0x40000000 /* New network namespace */
| > #define CLONE_IO 0x80000000 /* Clone io context */
| >
| > +#define VALID_CLONE_FLAGS (CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\
| > + CLONE_SIGHAND | CLONE_PTRACE | CLONE_VFORK |\
| > + CLONE_PARENT | CLONE_THREAD | CLONE_NEWNS |\
| > + CLONE_SYSVSEM | CLONE_SETTLS |\
| > + CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID |\
| > + CLONE_DETACHED | CLONE_UNTRACED |\
| > + CLONE_CHILD_SETTID | CLONE_STOPPED |\
| > + CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\
| > + CLONE_NEWPID | CLONE_NEWNET| CLONE_IO)
| > +
| > /*
| > * Scheduling policies
| > */
| > Index: linux-2.6/kernel/fork.c
| > ===================================================================
| > --- linux-2.6.orig/kernel/fork.c 2009-10-02 19:00:08.000000000 -0700
| > +++ linux-2.6/kernel/fork.c 2009-10-02 19:57:36.000000000 -0700
| > @@ -942,6 +942,9 @@ static struct task_struct *copy_process(
| > struct task_struct *p;
| > int cgroup_callbacks_done = 0;
| >
|
| We can safely apply these tests to clone3(), because it is a new syscall.
|
| However, I don't know if applying it to clone() can break existing
| application that may already be (incorrectly) using invalid flags ?

Doing the check in copy_process() seems would apply to all architectures.

As for breaking an existing app, there is only one unused flag, 0x00001000
(bw CLONE_SIGHAND and CLONE_PTRACE). If an application could be using that
and we don't want to break it, maybe we can add following macro to the
VALID_CLONE_FLAGS list ?

#define CLONE_UNUSED_BIT 0x00001000

2009-10-14 00:01:25

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

On 10/12/2009 09:49 PM, Sukadev Bhattiprolu wrote:
>
> This patchset implements a new system call, clone3() that lets a process
> specify the pids of the child process.
>

A system call named clone3() taking two parameters is just too weird to
live. No, please.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-10-13 23:55:10

by Roland McGrath

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

> My only concern is the support of 64-bit clone flags on 32-bit architectures.

Oy. I didn't realize there was serious consideration of having more than
32 flags. IMHO it would be a bad choice, since they could only be used via
clone3. Having high-bit flags work in clone on 64-bit machines but not on
32-bit machines just seems like a wrongly confusing way for things to be.
If any high-bits flags are constrained even on 64-bit machines to uses in
clone3 calls for sanity purposes, then it seems questionable IMHO to have
them be more flags in the same u64 at all.

Since all new features will be via this struct, various new kinds of things
could potentially be done by other new struct fields independent of flags.
But that would of course require putting enough reserved fields in now and
requiring that they be zero-filled now in anticipation of such future uses,
which is not very pleasant either.

In short, I guess I really am saying that "clone_flags_high" (or
"more_flags" or something) does seem better to me than any of the
possibilities for having more than 32 CLONE_* in the current flags word.


Thanks,
Roland

2009-10-14 01:24:45

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

On 10/13/2009 04:53 PM, Roland McGrath wrote:
>> My only concern is the support of 64-bit clone flags on 32-bit architectures.
>
> Oy. I didn't realize there was serious consideration of having more than
> 32 flags. IMHO it would be a bad choice, since they could only be used via
> clone3. Having high-bit flags work in clone on 64-bit machines but not on
> 32-bit machines just seems like a wrongly confusing way for things to be.
> If any high-bits flags are constrained even on 64-bit machines to uses in
> clone3 calls for sanity purposes, then it seems questionable IMHO to have
> them be more flags in the same u64 at all.
>
> Since all new features will be via this struct, various new kinds of things
> could potentially be done by other new struct fields independent of flags.
> But that would of course require putting enough reserved fields in now and
> requiring that they be zero-filled now in anticipation of such future uses,
> which is not very pleasant either.
>
> In short, I guess I really am saying that "clone_flags_high" (or
> "more_flags" or something) does seem better to me than any of the
> possibilities for having more than 32 CLONE_* in the current flags word.
>

Overall it seems sane to:

a) make it an actual 3-argument call;
b) make the existing flags a u32 forever, and make it a separate
argument;
c) any new expansion can be via the struct, which may want to have
an "c3_flags" field first in the structure.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-10-14 01:40:24

by Matt Helsley

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

On Tue, Oct 13, 2009 at 04:49:05PM -0700, H. Peter Anvin wrote:
> On 10/12/2009 09:49 PM, Sukadev Bhattiprolu wrote:
> >
> > This patchset implements a new system call, clone3() that lets a process
> > specify the pids of the child process.
> >
>
> A system call named clone3() taking two parameters is just too weird to
> live. No, please.

Except we can't use clone2() because it conflicts on ia64. Care to propose
a name you would prefer?

Also I was a bit suprised to discover there are plenty of examples where this
convention has not been followed: vm86, lseek64, and mmap2 to name a few. In
fact, of the 46 __NR_foo[[:digit:]]+, 36 break this convention on x86-32.

Cheers,
-Matt Helsley

2009-10-14 02:32:26

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

On 10/13/2009 06:39 PM, Matt Helsley wrote:
> On Tue, Oct 13, 2009 at 04:49:05PM -0700, H. Peter Anvin wrote:
>> On 10/12/2009 09:49 PM, Sukadev Bhattiprolu wrote:
>>>
>>> This patchset implements a new system call, clone3() that lets a process
>>> specify the pids of the child process.
>>>
>>
>> A system call named clone3() taking two parameters is just too weird to
>> live. No, please.
>
> Except we can't use clone2() because it conflicts on ia64. Care to propose
> a name you would prefer?
>
> Also I was a bit suprised to discover there are plenty of examples where this
> convention has not been followed: vm86, lseek64, and mmap2 to name a few. In
> fact, of the 46 __NR_foo[[:digit:]]+, 36 break this convention on x86-32.
>

The -86, -64 and so on are visually obviously not a parameter count.
sys_mmap2 is not user visible, and so doesn't really matter.

-hpa


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-10-14 04:35:30

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

H. Peter Anvin [[email protected]] wrote:
| On 10/13/2009 04:53 PM, Roland McGrath wrote:
| >> My only concern is the support of 64-bit clone flags on 32-bit architectures.
| >
| > Oy. I didn't realize there was serious consideration of having more than
| > 32 flags. IMHO it would be a bad choice, since they could only be used via
| > clone3. Having high-bit flags work in clone on 64-bit machines but not on
| > 32-bit machines just seems like a wrongly confusing way for things to be.
| > If any high-bits flags are constrained even on 64-bit machines to uses in
| > clone3 calls for sanity purposes, then it seems questionable IMHO to have
| > them be more flags in the same u64 at all.
| >
| > Since all new features will be via this struct, various new kinds of things
| > could potentially be done by other new struct fields independent of flags.
| > But that would of course require putting enough reserved fields in now and
| > requiring that they be zero-filled now in anticipation of such future uses,
| > which is not very pleasant either.
| >
| > In short, I guess I really am saying that "clone_flags_high" (or
| > "more_flags" or something) does seem better to me than any of the
| > possibilities for having more than 32 CLONE_* in the current flags word.
| >
|
| Overall it seems sane to:
|
| a) make it an actual 3-argument call;
| b) make the existing flags a u32 forever, and make it a separate
| argument;
| c) any new expansion can be via the struct, which may want to have
| an "c3_flags" field first in the structure.

Wouldn't that make it a somewhat awkward interface - applications or
libc have to split the clone flags and specify them in two different
fields ? Do we need to preserve the u32 flags forever ?

The closest example I can think of is the distinction between signal()
and sigaction(). sigaction() interface makes a lot of sense in part bc
it does not try to look like signal(). And strace seems to be able
decode flags like SA_RESTART and SA_SIGINFO with the pointer to the
sigaction buffer.

Won't the debug tools be able to do the same with a pointer to
struct clone_args ?

If it needs the new features of clone3(), the application has to be
modified. At that point wouldn't it be easy to write a clone3() wrapper
that calls sys_clone3() if it exists or call sys_clone() if not ?

If clone3() supersedes clone(), all architectures could use the clone3()
interface and it would seem best to make clone3() as simple as possible
without having to preserve the u32 flags.

Would it help to use a type clone_flags_64_t to make the distinction
between types more explicit ?

Sukadev

2009-10-14 04:50:21

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

On 10/13/2009 09:36 PM, Sukadev Bhattiprolu wrote:
>
> Would it help to use a type clone_flags_64_t to make the distinction
> between types more explicit ?
>

The problem with using the same flags in two places, one as a 32-bit and
one as a 64-bit number, is that using one in the wrong place will cause
silent, but deadly, truncation.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-10-14 04:40:00

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

H. Peter Anvin [[email protected]] wrote:
| >
| > Except we can't use clone2() because it conflicts on ia64. Care to propose
| > a name you would prefer?

Yes, I am running out of ideas :-)

How about clone64_with_pids() ? - hope we don't need a 65th clone-flag :p

2009-10-14 04:58:19

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

On 10/13/2009 09:40 PM, Sukadev Bhattiprolu wrote:
> H. Peter Anvin [[email protected]] wrote:
> | >
> | > Except we can't use clone2() because it conflicts on ia64. Care to propose
> | > a name you would prefer?
>
> Yes, I am running out of ideas :-)
>
> How about clone64_with_pids() ? - hope we don't need a 65th clone-flag :p

never_clone_alone() ;)

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-10-14 12:27:18

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall

On Tuesday 13 October 2009, Sukadev Bhattiprolu wrote:
> +clone3(struct clone_struct * __user clone_args, pid_t * __user pids)
> +
> + In addition to doing everything that clone() system call does,
> + the clone3() system call:
> +
> + - allows additional clone flags (all 32 bits in the flags
> + parameter to clone() are in use)
> +
> + - allows user to specify a pid for the child process in its
> + active and ancestor pid name spaces.

Someone (sorry, can't find the old mail) pointed out last time that
the 'pid_t *__user tidptr' argument needs to be an independent pointer,
in order to allow the same use patterns with CLONE_CHILD_SETTID and
CLONE_CHILD_CLEARTID that you can do with the current clone
implementation.

Moving that argument from clone_struct into the argument list would
also make this a three-argument syscall, which solves the naming problem.

Arnd <><

2009-10-14 16:08:43

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Quoting Sukadev Bhattiprolu ([email protected]):
> H. Peter Anvin [[email protected]] wrote:
> | >
> | > Except we can't use clone2() because it conflicts on ia64. Care to propose
> | > a name you would prefer?
>
> Yes, I am running out of ideas :-)
>
> How about clone64_with_pids() ? - hope we don't need a 65th clone-flag :p

Well, maybe accept that adding more flags will still require a new syscall,
and call this clone_with_pids(), and only take a single 32-bit flags field?

Or, go back to clone_extended().

-serge

2009-10-14 18:38:39

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall

Arnd Bergmann [[email protected]] wrote:
| On Tuesday 13 October 2009, Sukadev Bhattiprolu wrote:
| > +clone3(struct clone_struct * __user clone_args, pid_t * __user pids)
| > +
| > + In addition to doing everything that clone() system call does,
| > + the clone3() system call:
| > +
| > + - allows additional clone flags (all 32 bits in the flags
| > + parameter to clone() are in use)
| > +
| > + - allows user to specify a pid for the child process in its
| > + active and ancestor pid name spaces.
|
| Someone (sorry, can't find the old mail) pointed out last time that
| the 'pid_t *__user tidptr' argument needs to be an independent pointer,
| in order to allow the same use patterns with CLONE_CHILD_SETTID and
| CLONE_CHILD_CLEARTID that you can do with the current clone
| implementation.

I think it was Oren Laadan.

|
| Moving that argument from clone_struct into the argument list would
| also make this a three-argument syscall, which solves the naming problem.

How about we pull 'nr_pids' out of clone_args struct and make that
a separate parameter. It seems the odd man out and controls the pid_t
list parameter.
|
| Arnd <><

2009-10-14 22:36:07

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

H. Peter Anvin [[email protected]] wrote:
| On 10/13/2009 04:53 PM, Roland McGrath wrote:
| >> My only concern is the support of 64-bit clone flags on 32-bit architectures.
| >
| > Oy. I didn't realize there was serious consideration of having more than
| > 32 flags. IMHO it would be a bad choice, since they could only be used via
| > clone3. Having high-bit flags work in clone on 64-bit machines but not on
| > 32-bit machines just seems like a wrongly confusing way for things to be.
| > If any high-bits flags are constrained even on 64-bit machines to uses in
| > clone3 calls for sanity purposes, then it seems questionable IMHO to have
| > them be more flags in the same u64 at all.
| >
| > Since all new features will be via this struct, various new kinds of things
| > could potentially be done by other new struct fields independent of flags.
| > But that would of course require putting enough reserved fields in now and
| > requiring that they be zero-filled now in anticipation of such future uses,
| > which is not very pleasant either.
| >
| > In short, I guess I really am saying that "clone_flags_high" (or
| > "more_flags" or something) does seem better to me than any of the
| > possibilities for having more than 32 CLONE_* in the current flags word.
| >
|
| Overall it seems sane to:
|
| a) make it an actual 3-argument call;
| b) make the existing flags a u32 forever, and make it a separate
| argument;
| c) any new expansion can be via the struct, which may want to have
| an "c3_flags" field first in the structure.

Ok, So will this work ?

struct clone_args {
u32 flags_high; /* new clone flags (higher bits) */
u32 reserved1;
u32 nr_pids;
u32 reserved2;
u64 child_stack_base;
u64 child_stack_size;
u64 parent_tid_ptr;
u64 child_tid_ptr;
u64 reserved3;
};

sys_clone3(u32 flags_low, struct clone_args *args, pid_t *pid_list)

Even on 64bit architectures the applications have to use sys_clone3() for
the extended features.

2009-10-14 23:03:28

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

On 10/14/2009 03:36 PM, Sukadev Bhattiprolu wrote:
> H. Peter Anvin [[email protected]] wrote:
> |
> | Overall it seems sane to:
> |
> | a) make it an actual 3-argument call;
> | b) make the existing flags a u32 forever, and make it a separate
> | argument;
> | c) any new expansion can be via the struct, which may want to have
> | an "c3_flags" field first in the structure.
>
> Ok, So will this work ?
>
> struct clone_args {
> u32 flags_high; /* new clone flags (higher bits) */
> u32 reserved1;
> u32 nr_pids;
> u32 reserved2;
> u64 child_stack_base;
> u64 child_stack_size;
> u64 parent_tid_ptr;
> u64 child_tid_ptr;
> u64 reserved3;
> };
>
> sys_clone3(u32 flags_low, struct clone_args *args, pid_t *pid_list)
>
> Even on 64bit architectures the applications have to use sys_clone3() for
> the extended features.

Yes, although I'd just make flags_high a u64. The other thing that
might be worthwhile is to have a length field on the structure; that way
we could add new fields at the end if ever necessary in the future.

-hpa

2009-10-15 00:16:57

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

H. Peter Anvin [[email protected]] wrote:
| On 10/14/2009 03:36 PM, Sukadev Bhattiprolu wrote:
| > H. Peter Anvin [[email protected]] wrote:
| > |
| > | Overall it seems sane to:
| > |
| > | a) make it an actual 3-argument call;
| > | b) make the existing flags a u32 forever, and make it a separate
| > | argument;
| > | c) any new expansion can be via the struct, which may want to have
| > | an "c3_flags" field first in the structure.
| >
| > Ok, So will this work ?
| >
| > struct clone_args {
| > u32 flags_high; /* new clone flags (higher bits) */
| > u32 reserved1;
| > u32 nr_pids;
| > u32 reserved2;
| > u64 child_stack_base;
| > u64 child_stack_size;
| > u64 parent_tid_ptr;
| > u64 child_tid_ptr;
| > u64 reserved3;
| > };
| >
| > sys_clone3(u32 flags_low, struct clone_args *args, pid_t *pid_list)
| >
| > Even on 64bit architectures the applications have to use sys_clone3() for
| > the extended features.
|
| Yes, although I'd just make flags_high a u64.

so we allow 96 bits for flags ?

| The other thing that might be worthwhile is to have a length field on
| the structure; that way we could add new fields at the end if ever
| necessary in the future.

So:
struct clone_args {
u64 flags_high; /* new clone flags (higher bits) */
u64 reserved1;

u32 nr_pids;
u32 clone_args_size;

u64 child_stack_base;
u64 child_stack_size;

u64 parent_tid_ptr;
u64 child_tid_ptr;
};

sys_clone3(u32 flags_low, struct clone_args *args, pid_t *pid_list)

BTW, on 64-bit architectures, the flags_low would be 64-bits, but the high-
bits there would be ignored right ?

Not sure if we need a second reserved field now that we add ->clone_args_size.

Thanks,

Sukadev

2009-10-15 00:23:41

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 4/10]: Add target_pid parameter to alloc_pidmap()

Pavel Emelyanov [[email protected]] wrote:
| Sukadev Bhattiprolu wrote:
| >
| > Subject: [RFC][v8][PATCH 4/10]: Add target_pid parameter to alloc_pidmap()
| >
|
| [ snip ]
|
| > @@ -146,16 +146,22 @@ static int alloc_pidmap_page(struct pidm
| > return 0;
| > }
| >
| > -static int alloc_pidmap(struct pid_namespace *pid_ns)
| > +static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid)
|
| Please no! Better create another function, that will just
| atomic_test_and_set() the proper bit in the map. This one
| is heavy enough already.

I had a set_pidmap() function before:

http://lkml.org/lkml/2009/8/7/24

but dropped it based on some comments and the current version evolved
from it. I could go back to that if it still makes sense.

Thanks,

Sukadev

2009-10-16 04:19:49

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

Here is an updated patch with the following interface:

long sys_clone3(unsigned int flags_low, struct clone_args __user *cs,
pid_t *pids);

There are just two other (minor) changes pending to this patchset:

- PATCH 7: add a CLONE_UNUSED bit to VALID_CLONE_FLAGS().
- PATCH 10: update documentation to reflect new interface.

If this looks ok, we repost entire patchset next week.
---

Subject: [RFC][v8][PATCH 9/10]: Define clone3() syscall

Container restart requires that a task have the same pid it had when it was
checkpointed. When containers are nested the tasks within the containers
exist in multiple pid namespaces and hence have multiple pids to specify
during restart.

clone3(), intended for use during restart, is the same as clone(), except
that it takes a 'pids' paramter. This parameter lets caller choose
specific pid numbers for the child process, in the process's active and
ancestor pid namespaces. (Descendant pid namespaces in general don't matter
since processes don't have pids in them anyway, but see comments in
copy_target_pids() regarding CLONE_NEWPID).

Clone2() system call also attempts to address a second limitation of the
clone() system call. clone() is restricted to 32 clone flags and most (all ?)
of these are in use. If a new clone flag is needed, we will be forced to
define a new variant of the clone() system call.

To prevent unprivileged processes from misusing this interface, clone3()
currently needs CAP_SYS_ADMIN, when the 'pids' parameter is non-NULL.

See Documentation/clone3 in next patch for more details of clone3() and an
example of its usage.

NOTE:
- System calls are restricted to 6 parameters and the number and sizes
of parameters needed for sys_clone3() exceed 6 integers. The new
prototype works around this restriction while providing some
flexibility if clone3() needs to be further extended in the future.
TODO:
- We should convert clone-flags to 64-bit value in all architectures.
Its probably best to do that as a separate patchset since clone_flags
touches several functions and that patchset seems independent of this
new system call.


Changelog[v9-rc1]:
- [Roland McGrath, H. Peter Anvin] To avoid confusion on 64-bit
architectures split the new clone-flags into 'low' and 'high'
words and pass in the 'lower' flags as the first argument.
This would maintain similarity of the clone3() with clone()/
clone2(). Also has the side-effect of the name matching the
number of parameters :-)
- [Roland McGrath] Rename structure to 'clone_args' and add a
'child_stack_size' field

Changelog[v8]
- [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg'
must be 64-bit.
- clone2() is in use in IA64. Rename system call to clone3().

Changelog[v7]:
- [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
and group parameters into a new 'struct clone_struct' object.

Changelog[v6]:
- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
is constant across architectures.
- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
'unum_pids < 0' check.

Changelog[v4]:
- (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'

Changelog[v3]:
- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
in the target_pids[] list and setting it 0. See copy_target_pids()).
- (Oren Laadan) Specified target pids should apply only to youngest
pid-namespaces (see copy_target_pids())
- (Matt Helsley) Update patch description.

Changelog[v2]:
- Remove unnecessary printk and add a note to callers of
copy_target_pids() to free target_pids.
- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
'num_pids == 0' (fall back to normal clone()).
- Move arch-independent code (sanity checks and copy-in of target-pids)
into kernel/fork.c and simplify sys_clone_with_pids()

Changelog[v1]:
- Fixed some compile errors (had fixed these errors earlier in my
git tree but had not refreshed patches before emailing them)

Signed-off-by: Sukadev Bhattiprolu <[email protected]>

---
arch/x86/include/asm/syscalls.h | 2
arch/x86/include/asm/unistd_32.h | 1
arch/x86/kernel/process_32.c | 61 +++++++++++++++++++++++
arch/x86/kernel/syscall_table_32.S | 1
include/linux/types.h | 11 ++++
kernel/fork.c | 96 ++++++++++++++++++++++++++++++++++++-
6 files changed, 171 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/types.h
===================================================================
--- linux-2.6.orig/include/linux/types.h 2009-10-15 20:29:47.000000000 -0700
+++ linux-2.6/include/linux/types.h 2009-10-15 20:53:22.000000000 -0700
@@ -204,6 +204,17 @@ struct ustat {
char f_fpack[6];
};

+struct clone_args {
+ u64 clone_flags_high;
+ u64 child_stack_base;
+ u64 child_stack_size;
+ u64 parent_tid_ptr;
+ u64 child_tid_ptr;
+ u32 nr_pids;
+ u32 clone_args_size;
+ u64 reserved1;
+};
+
#endif /* __KERNEL__ */
#endif /* __ASSEMBLY__ */
#endif /* _LINUX_TYPES_H */
Index: linux-2.6/arch/x86/include/asm/syscalls.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/syscalls.h 2009-10-15 20:29:47.000000000 -0700
+++ linux-2.6/arch/x86/include/asm/syscalls.h 2009-10-15 20:38:53.000000000 -0700
@@ -55,6 +55,8 @@ struct sel_arg_struct;
struct oldold_utsname;
struct old_utsname;

+asmlinkage long sys_clone3(unsigned int flags_low,
+ struct clone_args __user *cs, pid_t *pids);
asmlinkage long sys_mmap2(unsigned long, unsigned long, unsigned long,
unsigned long, unsigned long, unsigned long);
asmlinkage int old_mmap(struct mmap_arg_struct __user *);
Index: linux-2.6/arch/x86/include/asm/unistd_32.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/unistd_32.h 2009-10-15 20:29:47.000000000 -0700
+++ linux-2.6/arch/x86/include/asm/unistd_32.h 2009-10-15 20:38:53.000000000 -0700
@@ -342,6 +342,7 @@
#define __NR_pwritev 334
#define __NR_rt_tgsigqueueinfo 335
#define __NR_perf_counter_open 336
+#define __NR_clone3 337

#ifdef __KERNEL__

Index: linux-2.6/arch/x86/kernel/process_32.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/process_32.c 2009-10-15 20:29:47.000000000 -0700
+++ linux-2.6/arch/x86/kernel/process_32.c 2009-10-15 20:38:53.000000000 -0700
@@ -443,6 +443,67 @@ int sys_clone(struct pt_regs *regs)
return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
}

+asmlinkage long
+sys_clone3(unsigned int flags_low, struct clone_args __user *ucs,
+ pid_t __user *pids)
+{
+ int rc;
+ struct clone_args kcs;
+ unsigned long flags;
+ int __user *parent_tid_ptr;
+ int __user *child_tid_ptr;
+ unsigned long __user child_stack;
+ unsigned long stack_size;
+ struct pt_regs *regs;
+
+ rc = copy_from_user(&kcs, ucs, sizeof(kcs));
+ if (rc)
+ return -EFAULT;
+
+ /*
+ * TODO: If size of clone_args is not what the kernel expects, it
+ * could be that kernel is newer and has an extended structure.
+ * When that happens, this check needs to be smarter (and we
+ * need an additional copy_from_user()). For now, assume exact
+ * match.
+ */
+ if (kcs.clone_args_size != sizeof(kcs))
+ return -EINVAL;
+
+ /*
+ * To avoid future compatibility issues, ensure unused fields are 0.
+ */
+ if (kcs.reserved1 || kcs.clone_flags_high)
+ return -EINVAL;
+
+ /*
+ * TODO: Convert 'clone-flags' to 64-bits on all architectures.
+ * TODO: When ->clone_flags_high is non-zero, copy it in to the
+ * higher word(s) of 'flags':
+ *
+ * flags = (kcs.clone_flags_high << 32) | flags_low;
+ */
+ flags = flags_low;
+ parent_tid_ptr = (int *)kcs.parent_tid_ptr;
+ child_tid_ptr = (int *)kcs.child_tid_ptr;
+
+ stack_size = (unsigned long)kcs.child_stack_size;
+ child_stack = (unsigned long)kcs.child_stack_base + stack_size;
+
+ regs = task_pt_regs(current);
+
+ if (!child_stack)
+ child_stack = user_stack_pointer(regs);
+
+ /*
+ * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value
+ * to several functions. Need to convert clone_flags to 64-bit.
+ */
+ return do_fork_with_pids(flags, child_stack, regs, stack_size,
+ parent_tid_ptr, child_tid_ptr, kcs.nr_pids,
+ pids);
+}
+
/*
* sys_execve() executes a new program.
*/
Index: linux-2.6/arch/x86/kernel/syscall_table_32.S
===================================================================
--- linux-2.6.orig/arch/x86/kernel/syscall_table_32.S 2009-10-15 20:29:47.000000000 -0700
+++ linux-2.6/arch/x86/kernel/syscall_table_32.S 2009-10-15 20:38:53.000000000 -0700
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
.long sys_pwritev
.long sys_rt_tgsigqueueinfo /* 335 */
.long sys_perf_counter_open
+ .long sys_clone3
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c 2009-10-15 20:38:50.000000000 -0700
+++ linux-2.6/kernel/fork.c 2009-10-15 20:38:53.000000000 -0700
@@ -1330,6 +1330,86 @@ struct task_struct * __cpuinit fork_idle
}

/*
+ * If user specified any 'target-pids' in @upid_setp, copy them from
+ * user and return a pointer to a local copy of the list of pids. The
+ * caller must free the list, when they are done using it.
+ *
+ * If user did not specify any target pids, return NULL (caller should
+ * treat this like normal clone).
+ *
+ * On any errors, return the error code
+ */
+static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids)
+{
+ int j;
+ int rc;
+ int size;
+ int knum_pids; /* # of pids needed in kernel */
+ pid_t *target_pids;
+
+ if (!unum_pids)
+ return NULL;
+
+ knum_pids = task_pid(current)->level + 1;
+ if (unum_pids > knum_pids)
+ return ERR_PTR(-EINVAL);
+
+ /*
+ * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
+ * and set it to 0. This last entry in target_pids[] corresponds to the
+ * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
+ * specified. If CLONE_NEWPID was not specified, this last entry will
+ * simply be ignored.
+ */
+ target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
+ if (!target_pids)
+ return ERR_PTR(-ENOMEM);
+
+ /*
+ * A process running in a level 2 pid namespace has three pid namespaces
+ * and hence three pid numbers. If this process is checkpointed,
+ * information about these three namespaces are saved. We refer to these
+ * namespaces as 'known namespaces'.
+ *
+ * If this checkpointed process is however restarted in a level 3 pid
+ * namespace, the restarted process has an extra ancestor pid namespace
+ * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
+ *
+ * During restart, the process requests specific pids for its 'known
+ * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
+ *
+ * Since the requested-pids correspond to 'known namespaces' and since
+ * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
+ * namespaces', copy requested pids to the back-end of target_pids[]
+ * (i.e before the last entry for CLONE_NEWPID mentioned above).
+ * Any entries in target_pids[] not corresponding to a requested pid
+ * will be set to zero and kernel assigns a pid in those namespaces.
+ *
+ * NOTE: The order of pids in target_pids[] is oldest pid namespace to
+ * youngest (target_pids[0] corresponds to init_pid_ns). i.e.
+ * the order is:
+ *
+ * - pids for 'unknown-namespaces' (if any)
+ * - pids for 'known-namespaces' (requested pids)
+ * - 0 in the last entry (for CLONE_NEWPID).
+ */
+ j = knum_pids - unum_pids;
+ size = unum_pids * sizeof(pid_t);
+
+ rc = copy_from_user(&target_pids[j], upids, size);
+ if (rc) {
+ rc = -EFAULT;
+ goto out_free;
+ }
+
+ return target_pids;
+
+out_free:
+ kfree(target_pids);
+ return ERR_PTR(rc);
+}
+
+/*
* Ok, this is the main fork-routine.
*
* It copies the process, and if successful kick-starts
@@ -1347,7 +1427,7 @@ long do_fork_with_pids(unsigned long clo
struct task_struct *p;
int trace = 0;
long nr;
- pid_t *target_pids = NULL;
+ pid_t *target_pids;

/*
* Do some preliminary argument and permissions checking before we
@@ -1381,6 +1461,16 @@ long do_fork_with_pids(unsigned long clo
}
}

+ target_pids = copy_target_pids(num_pids, upids);
+ if (target_pids) {
+ if (IS_ERR(target_pids))
+ return PTR_ERR(target_pids);
+
+ nr = -EPERM;
+ if (!capable(CAP_SYS_ADMIN))
+ goto out_free;
+ }
+
/*
* When called from kernel_thread, don't do user tracing stuff.
*/
@@ -1442,6 +1532,10 @@ long do_fork_with_pids(unsigned long clo
} else {
nr = PTR_ERR(p);
}
+
+out_free:
+ kfree(target_pids);
+
return nr;
}

2009-10-16 06:27:13

by Michael Kerrisk

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

Hi Sukadev

On Fri, Oct 16, 2009 at 6:20 AM, Sukadev Bhattiprolu
<[email protected]> wrote:
> Here is an updated patch with the following interface:
>
> ? ? ? ?long sys_clone3(unsigned int flags_low, struct clone_args __user *cs,
> ? ? ? ? ? ? ? ? ? ? ? ?pid_t *pids);
>
> There are just two other (minor) changes pending to this patchset:
>
> ? ? ? ?- PATCH 7: add a CLONE_UNUSED bit to VALID_CLONE_FLAGS().
> ? ? ? ?- PATCH 10: update documentation to reflect new interface.
>
> If this looks ok, we repost entire patchset next week.

I know I'm late to this discussion, but why the name clone3()? It's
not consistent with any other convention used fo syscall naming,
AFAICS. I think a name like clone_ext() or clonex() (for extended)
might make more sense.

Cheers,

Michael


> ---
>
> Subject: [RFC][v8][PATCH 9/10]: Define clone3() syscall
>
> Container restart requires that a task have the same pid it had when it was
> checkpointed. When containers are nested the tasks within the containers
> exist in multiple pid namespaces and hence have multiple pids to specify
> during restart.
>
> clone3(), intended for use during restart, is the same as clone(), except
> that it takes a 'pids' paramter. This parameter lets caller choose
> specific pid numbers for the child process, in the process's active and
> ancestor pid namespaces. (Descendant pid namespaces in general don't matter
> since processes don't have pids in them anyway, but see comments in
> copy_target_pids() regarding CLONE_NEWPID).
>
> Clone2() system call also attempts to address a second limitation of the
> clone() system call. clone() is restricted to 32 clone flags and most (all ?)
> of these are in use. If a new clone flag is needed, we will be forced to
> define a new variant of the clone() system call.
>
> To prevent unprivileged processes from misusing this interface, clone3()
> currently needs CAP_SYS_ADMIN, when the 'pids' parameter is non-NULL.
>
> See Documentation/clone3 in next patch for more details of clone3() and an
> example of its usage.
>
> NOTE:
> ? ? ? ?- System calls are restricted to 6 parameters and the number and sizes
> ? ? ? ? ?of parameters needed for sys_clone3() exceed 6 integers. The new
> ? ? ? ? ?prototype works around this restriction while providing some
> ? ? ? ? ?flexibility if clone3() needs to be further extended in the future.
> TODO:
> ? ? ? ?- We should convert clone-flags to 64-bit value in all architectures.
> ? ? ? ? ?Its probably best to do that as a separate patchset since clone_flags
> ? ? ? ? ?touches several functions and that patchset seems independent of this
> ? ? ? ? ?new system call.
>
>
> Changelog[v9-rc1]:
> ? ? ? ?- [Roland McGrath, H. Peter Anvin] To avoid confusion on 64-bit
> ? ? ? ? ?architectures split the new clone-flags into 'low' and 'high'
> ? ? ? ? ?words and pass in the 'lower' flags as the first argument.
> ? ? ? ? ?This would maintain similarity of the clone3() with clone()/
> ? ? ? ? ?clone2(). Also has the side-effect of the name matching the
> ? ? ? ? ?number of parameters :-)
> ? ? ? ?- [Roland McGrath] Rename structure to 'clone_args' and add a
> ? ? ? ? ?'child_stack_size' field
>
> Changelog[v8]
> ? ? ? ?- [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg'
> ? ? ? ? ?must be 64-bit.
> ? ? ? ?- clone2() is in use in IA64. Rename system call to clone3().
>
> Changelog[v7]:
> ? ? ? ?- [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
> ? ? ? ? ?and group parameters into a new 'struct clone_struct' object.
>
> Changelog[v6]:
> ? ? ? ?- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
> ? ? ? ? ?Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
> ? ? ? ? ?is constant across architectures.
> ? ? ? ?- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
> ? ? ? ? ?'unum_pids < 0' check.
>
> Changelog[v4]:
> ? ? ? ?- (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'
>
> Changelog[v3]:
> ? ? ? ?- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
> ? ? ? ? ?in the target_pids[] list and setting it 0. See copy_target_pids()).
> ? ? ? ?- (Oren Laadan) Specified target pids should apply only to youngest
> ? ? ? ? ?pid-namespaces (see copy_target_pids())
> ? ? ? ?- (Matt Helsley) Update patch description.
>
> Changelog[v2]:
> ? ? ? ?- Remove unnecessary printk and add a note to callers of
> ? ? ? ? ?copy_target_pids() to free target_pids.
> ? ? ? ?- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
> ? ? ? ?- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
> ? ? ? ? ?'num_pids == 0' (fall back to normal clone()).
> ? ? ? ?- Move arch-independent code (sanity checks and copy-in of target-pids)
> ? ? ? ? ?into kernel/fork.c and simplify sys_clone_with_pids()
>
> Changelog[v1]:
> ? ? ? ?- Fixed some compile errors (had fixed these errors earlier in my
> ? ? ? ? ?git tree but had not refreshed patches before emailing them)
>
> Signed-off-by: Sukadev Bhattiprolu <[email protected]>
>
> ---
> ?arch/x86/include/asm/syscalls.h ? ?| ? ?2
> ?arch/x86/include/asm/unistd_32.h ? | ? ?1
> ?arch/x86/kernel/process_32.c ? ? ? | ? 61 +++++++++++++++++++++++
> ?arch/x86/kernel/syscall_table_32.S | ? ?1
> ?include/linux/types.h ? ? ? ? ? ? ?| ? 11 ++++
> ?kernel/fork.c ? ? ? ? ? ? ? ? ? ? ?| ? 96 ++++++++++++++++++++++++++++++++++++-
> ?6 files changed, 171 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/include/linux/types.h
> ===================================================================
> --- linux-2.6.orig/include/linux/types.h ? ? ? ?2009-10-15 20:29:47.000000000 -0700
> +++ linux-2.6/include/linux/types.h ? ? 2009-10-15 20:53:22.000000000 -0700
> @@ -204,6 +204,17 @@ struct ustat {
> ? ? ? ?char ? ? ? ? ? ? ? ? ? ?f_fpack[6];
> ?};
>
> +struct clone_args {
> + ? ? ? u64 clone_flags_high;
> + ? ? ? u64 child_stack_base;
> + ? ? ? u64 child_stack_size;
> + ? ? ? u64 parent_tid_ptr;
> + ? ? ? u64 child_tid_ptr;
> + ? ? ? u32 nr_pids;
> + ? ? ? u32 clone_args_size;
> + ? ? ? u64 reserved1;
> +};
> +
> ?#endif /* __KERNEL__ */
> ?#endif /* ?__ASSEMBLY__ */
> ?#endif /* _LINUX_TYPES_H */
> Index: linux-2.6/arch/x86/include/asm/syscalls.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/syscalls.h ? ? ?2009-10-15 20:29:47.000000000 -0700
> +++ linux-2.6/arch/x86/include/asm/syscalls.h ? 2009-10-15 20:38:53.000000000 -0700
> @@ -55,6 +55,8 @@ struct sel_arg_struct;
> ?struct oldold_utsname;
> ?struct old_utsname;
>
> +asmlinkage long sys_clone3(unsigned int flags_low,
> + ? ? ? ? ? ? ? ? ? ? ? struct clone_args __user *cs, pid_t *pids);
> ?asmlinkage long sys_mmap2(unsigned long, unsigned long, unsigned long,
> ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long, unsigned long, unsigned long);
> ?asmlinkage int old_mmap(struct mmap_arg_struct __user *);
> Index: linux-2.6/arch/x86/include/asm/unistd_32.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/unistd_32.h ? ? 2009-10-15 20:29:47.000000000 -0700
> +++ linux-2.6/arch/x86/include/asm/unistd_32.h ?2009-10-15 20:38:53.000000000 -0700
> @@ -342,6 +342,7 @@
> ?#define __NR_pwritev ? ? ? ? ? 334
> ?#define __NR_rt_tgsigqueueinfo 335
> ?#define __NR_perf_counter_open 336
> +#define __NR_clone3 ? ? ? ? ? ?337
>
> ?#ifdef __KERNEL__
>
> Index: linux-2.6/arch/x86/kernel/process_32.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/process_32.c 2009-10-15 20:29:47.000000000 -0700
> +++ linux-2.6/arch/x86/kernel/process_32.c ? ? ?2009-10-15 20:38:53.000000000 -0700
> @@ -443,6 +443,67 @@ int sys_clone(struct pt_regs *regs)
> ? ? ? ?return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
> ?}
>
> +asmlinkage long
> +sys_clone3(unsigned int flags_low, struct clone_args __user *ucs,
> + ? ? ? ? ? ? ? pid_t __user *pids)
> +{
> + ? ? ? int rc;
> + ? ? ? struct clone_args kcs;
> + ? ? ? unsigned long flags;
> + ? ? ? int __user *parent_tid_ptr;
> + ? ? ? int __user *child_tid_ptr;
> + ? ? ? unsigned long __user child_stack;
> + ? ? ? unsigned long stack_size;
> + ? ? ? struct pt_regs *regs;
> +
> + ? ? ? rc = copy_from_user(&kcs, ucs, sizeof(kcs));
> + ? ? ? if (rc)
> + ? ? ? ? ? ? ? return -EFAULT;
> +
> + ? ? ? /*
> + ? ? ? ?* TODO: If size of clone_args is not what the kernel expects, it
> + ? ? ? ?* ? ? ? could be that kernel is newer and has an extended structure.
> + ? ? ? ?* ? ? ? When that happens, this check needs to be smarter (and we
> + ? ? ? ?* ? ? ? need an additional copy_from_user()). For now, assume exact
> + ? ? ? ?* ? ? ? match.
> + ? ? ? ?*/
> + ? ? ? if (kcs.clone_args_size != sizeof(kcs))
> + ? ? ? ? ? ? ? return -EINVAL;
> +
> + ? ? ? /*
> + ? ? ? ?* To avoid future compatibility issues, ensure unused fields are 0.
> + ? ? ? ?*/
> + ? ? ? if (kcs.reserved1 || kcs.clone_flags_high)
> + ? ? ? ? ? ? ? return -EINVAL;
> +
> + ? ? ? /*
> + ? ? ? ?* TODO: Convert 'clone-flags' to 64-bits on all architectures.
> + ? ? ? ?* TODO: When ->clone_flags_high is non-zero, copy it in to the
> + ? ? ? ?* ? ? ? higher word(s) of 'flags':
> + ? ? ? ?*
> + ? ? ? ?* ? ? ? ? ? ? ?flags = (kcs.clone_flags_high << 32) | flags_low;
> + ? ? ? ?*/
> + ? ? ? flags = flags_low;
> + ? ? ? parent_tid_ptr = (int *)kcs.parent_tid_ptr;
> + ? ? ? child_tid_ptr = ?(int *)kcs.child_tid_ptr;
> +
> + ? ? ? stack_size = (unsigned long)kcs.child_stack_size;
> + ? ? ? child_stack = (unsigned long)kcs.child_stack_base + stack_size;
> +
> + ? ? ? regs = task_pt_regs(current);
> +
> + ? ? ? if (!child_stack)
> + ? ? ? ? ? ? ? child_stack = user_stack_pointer(regs);
> +
> + ? ? ? /*
> + ? ? ? ?* TODO: On 32-bit systems, clone_flags is passed in as 32-bit value
> + ? ? ? ?* ? ? ? to several functions. Need to convert clone_flags to 64-bit.
> + ? ? ? ?*/
> + ? ? ? return do_fork_with_pids(flags, child_stack, regs, stack_size,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? parent_tid_ptr, child_tid_ptr, kcs.nr_pids,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? pids);
> +}
> +
> ?/*
> ?* sys_execve() executes a new program.
> ?*/
> Index: linux-2.6/arch/x86/kernel/syscall_table_32.S
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/syscall_table_32.S ? 2009-10-15 20:29:47.000000000 -0700
> +++ linux-2.6/arch/x86/kernel/syscall_table_32.S ? ? ? ?2009-10-15 20:38:53.000000000 -0700
> @@ -336,3 +336,4 @@ ENTRY(sys_call_table)
> ? ? ? ?.long sys_pwritev
> ? ? ? ?.long sys_rt_tgsigqueueinfo ? ? /* 335 */
> ? ? ? ?.long sys_perf_counter_open
> + ? ? ? .long sys_clone3
> Index: linux-2.6/kernel/fork.c
> ===================================================================
> --- linux-2.6.orig/kernel/fork.c ? ? ? ?2009-10-15 20:38:50.000000000 -0700
> +++ linux-2.6/kernel/fork.c ? ? 2009-10-15 20:38:53.000000000 -0700
> @@ -1330,6 +1330,86 @@ struct task_struct * __cpuinit fork_idle
> ?}
>
> ?/*
> + * If user specified any 'target-pids' in @upid_setp, copy them from
> + * user and return a pointer to a local copy of the list of pids. The
> + * caller must free the list, when they are done using it.
> + *
> + * If user did not specify any target pids, return NULL (caller should
> + * treat this like normal clone).
> + *
> + * On any errors, return the error code
> + */
> +static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids)
> +{
> + ? ? ? int j;
> + ? ? ? int rc;
> + ? ? ? int size;
> + ? ? ? int knum_pids; ? ? ? ? ?/* # of pids needed in kernel */
> + ? ? ? pid_t *target_pids;
> +
> + ? ? ? if (!unum_pids)
> + ? ? ? ? ? ? ? return NULL;
> +
> + ? ? ? knum_pids = task_pid(current)->level + 1;
> + ? ? ? if (unum_pids > knum_pids)
> + ? ? ? ? ? ? ? return ERR_PTR(-EINVAL);
> +
> + ? ? ? /*
> + ? ? ? ?* To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
> + ? ? ? ?* and set it to 0. This last entry in target_pids[] corresponds to the
> + ? ? ? ?* (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
> + ? ? ? ?* specified. If CLONE_NEWPID was not specified, this last entry will
> + ? ? ? ?* simply be ignored.
> + ? ? ? ?*/
> + ? ? ? target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
> + ? ? ? if (!target_pids)
> + ? ? ? ? ? ? ? return ERR_PTR(-ENOMEM);
> +
> + ? ? ? /*
> + ? ? ? ?* A process running in a level 2 pid namespace has three pid namespaces
> + ? ? ? ?* and hence three pid numbers. If this process is checkpointed,
> + ? ? ? ?* information about these three namespaces are saved. We refer to these
> + ? ? ? ?* namespaces as 'known namespaces'.
> + ? ? ? ?*
> + ? ? ? ?* If this checkpointed process is however restarted in a level 3 pid
> + ? ? ? ?* namespace, the restarted process has an extra ancestor pid namespace
> + ? ? ? ?* (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
> + ? ? ? ?*
> + ? ? ? ?* During restart, the process requests specific pids for its 'known
> + ? ? ? ?* namespaces' and lets kernel assign pids to its 'unknown namespaces'.
> + ? ? ? ?*
> + ? ? ? ?* Since the requested-pids correspond to 'known namespaces' and since
> + ? ? ? ?* 'known-namespaces' are younger than (i.e descendants of) 'unknown-
> + ? ? ? ?* namespaces', copy requested pids to the back-end of target_pids[]
> + ? ? ? ?* (i.e before the last entry for CLONE_NEWPID mentioned above).
> + ? ? ? ?* Any entries in target_pids[] not corresponding to a requested pid
> + ? ? ? ?* will be set to zero and kernel assigns a pid in those namespaces.
> + ? ? ? ?*
> + ? ? ? ?* NOTE: The order of pids in target_pids[] is oldest pid namespace to
> + ? ? ? ?* ? ? ? youngest (target_pids[0] corresponds to init_pid_ns). i.e.
> + ? ? ? ?* ? ? ? the order is:
> + ? ? ? ?*
> + ? ? ? ?* ? ? ? ? ? ? ?- pids for 'unknown-namespaces' (if any)
> + ? ? ? ?* ? ? ? ? ? ? ?- pids for 'known-namespaces' (requested pids)
> + ? ? ? ?* ? ? ? ? ? ? ?- 0 in the last entry (for CLONE_NEWPID).
> + ? ? ? ?*/
> + ? ? ? j = knum_pids - unum_pids;
> + ? ? ? size = unum_pids * sizeof(pid_t);
> +
> + ? ? ? rc = copy_from_user(&target_pids[j], upids, size);
> + ? ? ? if (rc) {
> + ? ? ? ? ? ? ? rc = -EFAULT;
> + ? ? ? ? ? ? ? goto out_free;
> + ? ? ? }
> +
> + ? ? ? return target_pids;
> +
> +out_free:
> + ? ? ? kfree(target_pids);
> + ? ? ? return ERR_PTR(rc);
> +}
> +
> +/*
> ?* ?Ok, this is the main fork-routine.
> ?*
> ?* It copies the process, and if successful kick-starts
> @@ -1347,7 +1427,7 @@ long do_fork_with_pids(unsigned long clo
> ? ? ? ?struct task_struct *p;
> ? ? ? ?int trace = 0;
> ? ? ? ?long nr;
> - ? ? ? pid_t *target_pids = NULL;
> + ? ? ? pid_t *target_pids;
>
> ? ? ? ?/*
> ? ? ? ? * Do some preliminary argument and permissions checking before we
> @@ -1381,6 +1461,16 @@ long do_fork_with_pids(unsigned long clo
> ? ? ? ? ? ? ? ?}
> ? ? ? ?}
>
> + ? ? ? target_pids = copy_target_pids(num_pids, upids);
> + ? ? ? if (target_pids) {
> + ? ? ? ? ? ? ? if (IS_ERR(target_pids))
> + ? ? ? ? ? ? ? ? ? ? ? return PTR_ERR(target_pids);
> +
> + ? ? ? ? ? ? ? nr = -EPERM;
> + ? ? ? ? ? ? ? if (!capable(CAP_SYS_ADMIN))
> + ? ? ? ? ? ? ? ? ? ? ? goto out_free;
> + ? ? ? }
> +
> ? ? ? ?/*
> ? ? ? ? * When called from kernel_thread, don't do user tracing stuff.
> ? ? ? ? */
> @@ -1442,6 +1532,10 @@ long do_fork_with_pids(unsigned long clo
> ? ? ? ?} else {
> ? ? ? ? ? ? ? ?nr = PTR_ERR(p);
> ? ? ? ?}
> +
> +out_free:
> + ? ? ? kfree(target_pids);
> +
> ? ? ? ?return nr;
> ?}
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

2009-10-16 18:05:04

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

Michael Kerrisk [[email protected]] wrote:
| Hi Sukadev
|
| On Fri, Oct 16, 2009 at 6:20 AM, Sukadev Bhattiprolu
| <[email protected]> wrote:
| > Here is an updated patch with the following interface:
| >
| > ? ? ? ?long sys_clone3(unsigned int flags_low, struct clone_args __user *cs,
| > ? ? ? ? ? ? ? ? ? ? ? ?pid_t *pids);
| >
| > There are just two other (minor) changes pending to this patchset:
| >
| > ? ? ? ?- PATCH 7: add a CLONE_UNUSED bit to VALID_CLONE_FLAGS().
| > ? ? ? ?- PATCH 10: update documentation to reflect new interface.
| >
| > If this looks ok, we repost entire patchset next week.
|
| I know I'm late to this discussion, but why the name clone3()? It's
| not consistent with any other convention used fo syscall naming,
| AFAICS. I think a name like clone_ext() or clonex() (for extended)
| might make more sense.

Sure, we talked about calling it clone_extended() and I can go back
to that.

Only minor concern with that name was if this new call ever needs to
be extended, what would we call it :-). With clone3() we could add a
real/fake parameter and call it clone4() :-p

Sukadev

2009-10-16 19:22:20

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Sukadev Bhattiprolu wrote:
> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
>
> To support application checkpoint/restart, a task must have the same pid it
> had when it was checkpointed. When containers are nested, the tasks within
> the containers exist in multiple pid namespaces and hence have multiple pids
> to specify during restart.
>
> This patchset implements a new system call, clone3() that lets a process
> specify the pids of the child process.
>
> Patches 1 through 7 are helper patches, needed for choosing a pid for the
> child process.
>
> PATCH 9 defines a prototype of the new system call. PATCH 10 adds some
> documentation on the new system call, some/all of which will eventually
> go into a man page.
>

Sorry for jumping so late in the discussion and for having maybe my
remarks pointless...

If this syscall is only for checkpoint / restart, why this shouldn't be
used with a future generic sys_restart syscall ?
Otherwise, shouldn't be more convenient to have something usable for
everyone, let's say:

cloneat(pid_t pid, pid_t desiredpid, ...);

Where 'desiredpid' is a hint of for the kernel for the pid to be
allocated (zero means the kernel will choose one for us) and the newly
allocated task is the son of 'pid'.
That looks more consistent with the "<syscall>at" family, 'openat',
'faccessat', 'readlinkat', etc ... and usable for something else than
the checkpoint / restart.

Thanks
-- Daniel

2009-10-16 19:43:22

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Daniel Lezcano [[email protected]] wrote:
> Sukadev Bhattiprolu wrote:
>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
>>
>> To support application checkpoint/restart, a task must have the same pid it
>> had when it was checkpointed. When containers are nested, the tasks within
>> the containers exist in multiple pid namespaces and hence have multiple pids
>> to specify during restart.
>>
>> This patchset implements a new system call, clone3() that lets a process
>> specify the pids of the child process.
>>
>> Patches 1 through 7 are helper patches, needed for choosing a pid for the
>> child process.
>>
>> PATCH 9 defines a prototype of the new system call. PATCH 10 adds some
>> documentation on the new system call, some/all of which will eventually
>> go into a man page.
>>
>
> Sorry for jumping so late in the discussion and for having maybe my
> remarks pointless...
>
> If this syscall is only for checkpoint / restart, why this shouldn't be
> used with a future generic sys_restart syscall ?

As I tried to explain in PATCH 0/9, the ability to choose a pid is only
for C/R but we are also trying to clone-flags so we won't need yet
another variant of clone() fairly soon.

> Otherwise, shouldn't be more convenient to have something usable for
> everyone, let's say:
>
> cloneat(pid_t pid, pid_t desiredpid, ...);
>
> Where 'desiredpid' is a hint of for the kernel for the pid to be
> allocated (zero means the kernel will choose one for us) and the newly
> allocated task is the son of 'pid'.

Hmm, so P1 would call cloneat() to create a child P3 _on behalf_ of process
P2 ? I did not know we had a requirement for that. Can you explain the
use-case more ? IOW, why can't P2 create the child P3 by itself ?

Note also that 'desiredpid' must be a list of pids (one for each pid
namespaces that the child will belong to) and hence we need 'nr_pids'
to specify the list. Given that we are limited to 6 parameters to the
syscall, such parameters must be stuffed into 'struct clone_args'.

So we should do something like:

sys_clone3(u32 flags_low, pid_t pid, struct clone_args *carg,
pid_t *desired_pids)

or (to match the name and parameters, move 'pid' parameter into clone_args)

> That looks more consistent with the "<syscall>at" family, 'openat',
> 'faccessat', 'readlinkat', etc ... and usable for something else than
> the checkpoint / restart.

The subtle difference though is that openat() does not open a file on
behalf of another process and so the 'at' suffix would not apply ?

>
> Thanks
> -- Daniel

2009-10-19 17:44:14

by Matt Helsley

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

On Fri, Oct 16, 2009 at 11:06:31AM -0700, Sukadev Bhattiprolu wrote:
> Michael Kerrisk [[email protected]] wrote:
> | Hi Sukadev
> |
> | On Fri, Oct 16, 2009 at 6:20 AM, Sukadev Bhattiprolu
> | <[email protected]> wrote:
> | > Here is an updated patch with the following interface:
> | >
> | > ? ? ? ?long sys_clone3(unsigned int flags_low, struct clone_args __user *cs,
> | > ? ? ? ? ? ? ? ? ? ? ? ?pid_t *pids);
> | >
> | > There are just two other (minor) changes pending to this patchset:
> | >
> | > ? ? ? ?- PATCH 7: add a CLONE_UNUSED bit to VALID_CLONE_FLAGS().
> | > ? ? ? ?- PATCH 10: update documentation to reflect new interface.
> | >
> | > If this looks ok, we repost entire patchset next week.
> |
> | I know I'm late to this discussion, but why the name clone3()? It's
> | not consistent with any other convention used fo syscall naming,
> | AFAICS. I think a name like clone_ext() or clonex() (for extended)
> | might make more sense.
>
> Sure, we talked about calling it clone_extended() and I can go back
> to that.
>
> Only minor concern with that name was if this new call ever needs to
> be extended, what would we call it :-). With clone3() we could add a
> real/fake parameter and call it clone4() :-p

Perhaps clone64 (somewhat like stat64 for example)?

Cheers,
-Matt Helsley

2009-10-19 20:34:56

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Sukadev Bhattiprolu wrote:
> Daniel Lezcano [[email protected]] wrote:
>
>> Sukadev Bhattiprolu wrote:
>>
>>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
>>>
>>> To support application checkpoint/restart, a task must have the same pid it
>>> had when it was checkpointed. When containers are nested, the tasks within
>>> the containers exist in multiple pid namespaces and hence have multiple pids
>>> to specify during restart.
>>>
>>> This patchset implements a new system call, clone3() that lets a process
>>> specify the pids of the child process.
>>>
>>> Patches 1 through 7 are helper patches, needed for choosing a pid for the
>>> child process.
>>>
>>> PATCH 9 defines a prototype of the new system call. PATCH 10 adds some
>>> documentation on the new system call, some/all of which will eventually
>>> go into a man page.
>>>
>>>
>> Sorry for jumping so late in the discussion and for having maybe my
>> remarks pointless...
>>
>> If this syscall is only for checkpoint / restart, why this shouldn't be
>> used with a future generic sys_restart syscall ?
>>
>
> As I tried to explain in PATCH 0/9, the ability to choose a pid is only
> for C/R but we are also trying to clone-flags so we won't need yet
> another variant of clone() fairly soon.
>
>
>> Otherwise, shouldn't be more convenient to have something usable for
>> everyone, let's say:
>>
>> cloneat(pid_t pid, pid_t desiredpid, ...);
>>
>> Where 'desiredpid' is a hint of for the kernel for the pid to be
>> allocated (zero means the kernel will choose one for us) and the newly
>> allocated task is the son of 'pid'.
>>
>
> Hmm, so P1 would call cloneat() to create a child P3 _on behalf_ of process
> P2 ? I did not know we had a requirement for that. Can you explain the
> use-case more ? IOW, why can't P2 create the child P3 by itself ?
>
I forgot to mention a constraint with the specified pid : P2 has to be
child of P1.
In other word, you can not specify a pid to clonat which is not your
descendant (including yourself).
With this constraint I think there is no security issues.

Concerning of forking on behalf of another process, we can consider it
is up to the caller / programmer to know what it does. If a process in
the process hierarchy exec'ed a program and we cloneat this process and
then the program fails because of an "unexpected error", well, we should
have not done that. A similar example is when the IPC are removed while
they are used by some other processes.

Here it is a interesting use case:
* if you created a pid namespace, and, let's say, booted a system
container where the container init is the "init" process, then with this
call you can enter the container at any time by doing cloneat() followed
by an exec of your command. I think that was a requirement when there
were discussions around "sys_hijack".

Another point. It's another way to extend the exhausted clone flags as
the cloneat can be called as a compatibility way, with cloneat(getpid(),
0, ... )

> Note also that 'desiredpid' must be a list of pids (one for each pid
> namespaces that the child will belong to) and hence we need 'nr_pids'
> to specify the list. Given that we are limited to 6 parameters to the
> syscall, such parameters must be stuffed into 'struct clone_args'.
>
> So we should do something like:
>
> sys_clone3(u32 flags_low, pid_t pid, struct clone_args *carg,
> pid_t *desired_pids)
>
> or (to match the name and parameters, move 'pid' parameter into clone_args)
>
Well, hiding multiple clone in one clone call is ... weird. AFAIR, there
was a debate between kernel or userspace proctree creation but it looks
like it's done from the kernel with this call.

I don't really see a difference between sys_restart(pid_t pid , int fd,
long flags) where pid_t is the topmost in the hierarchy, fd is a file
descriptor to a structure "pid_t * + struct clone_args *" and flags is
"PROCTREE".

IMHO, it is nicer to recursively restore the process tree for the nested
pid namespaces, that will be really an userspace process tree creation
and cloneat will be your friend here :)

>> That looks more consistent with the "<syscall>at" family, 'openat',
>> 'faccessat', 'readlinkat', etc ... and usable for something else than
>> the checkpoint / restart.
>>
>
> The subtle difference though is that openat() does not open a file on
> behalf of another process and so the 'at' suffix would not apply ?
>
Yes and no, depending of where you put the cursor. If you consider the
'at' suffix means a process context, then I agree with you, there is a
difference because the cloneat will be out of the current process
context. But if you consider the 'at' suffix as a context in general,
and openat means "relatively to a file descriptor" and cloneat means
"relatively to a pid namespace" the 'at' suffix may apply. But I agree
that we are so used to call the posix "fork", that cloneat sounds scary :)

Thanks
-- Daniel

2009-10-19 21:39:50

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

On 10/20/2009 02:44 AM, Matt Helsley wrote:
>> |
>> | I know I'm late to this discussion, but why the name clone3()? It's
>> | not consistent with any other convention used fo syscall naming,

This assumption, of course, is just plain wrong. Look at the wait
system calls, for example. However, when a small integer is used like
that, it pretty much always reflects numbers of arguments.

>> | AFAICS. I think a name like clone_ext() or clonex() (for extended)
>> | might make more sense.
>>
>> Sure, we talked about calling it clone_extended() and I can go back
>> to that.
>>
>> Only minor concern with that name was if this new call ever needs to
>> be extended, what would we call it :-). With clone3() we could add a
>> real/fake parameter and call it clone4() :-p
>
> Perhaps clone64 (somewhat like stat64 for example)?
>

I think that doesn't exactly reflect the nature of the changes.

clone3() is actually pretty good.

-hpa

2009-10-21 07:55:01

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall

Hi!

> This gives a brief overview of the clone3() system call. We should

Thanks!

> eventually describe more details in existing clone(2) man page or in
> a new man page.

M. Kerrisk (see MAINTAINERS) maintains man pages...

> Changelog[v8]:
> - clone2() is already in use in IA64. Rename syscall to clone3()
...
> Index: linux-2.6/Documentation/clone2
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/Documentation/clone2 2009-10-12 19:54:38.000000000 -0700

clone3?

> +struct clone_struct {
> + u64 flags;
> + u64 child_stack;

u64 seems wrong on 32 bit platforms. ulong?

> + u32 nr_pids;

So nr_pids is either 1 or 2?

> + u32 reserved1;
> + u64 parent_tid;
> + u64 child_tid;
> + u64 reserved2;
> +};
> +


> + See CLONE_NEWPID section of clone(2) man page for details about pid
> + namespaces.
> +
> + The order pids in @pids corresponds to the nesting order of pid-
> + namespaces, with @pids[0] corresponding to the init_pid_ns.

Ok, so I'm confused.

> + If a pid in the @pids list is 0, the kernel will assign the next
> + available pid in the pid namespace, for the process.
> +
> + If a pid in the @pids list is non-zero, the kernel tries to assign
> + the specified pid in that namespace. If that pid is already in use
> + by another process, the system call fails with -EBUSY.
...
> + On failure, clone3() returns -1 and sets 'errno' to one of following
> + values (the child process is not created).

Inconsistent with above. Syscalls really return -ERRCODE, errno is
glibc magic.

> +Example:
> +
> + pid_t pids[] = { 77, 99 };
> + struct clone_struct cs;
> +
> + cs.flags = (u64) SIGCHLD;
> + cs.child_stack = (u64) setup_child_stack();
> + cs.nr_pids = 2;
> + cs.parent_tid = 0LL;
> + cs.child_tid = 0LL;
> +
> + rc = syscall(__NR_clone3, &cs, pids);

Hmm, is there reason why pids are not at the end of struct
clone_struct? Passing most parameters in special structure, then pids
separately is strange...

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-10-19 21:47:45

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call



Daniel Lezcano wrote:
> Sukadev Bhattiprolu wrote:
>> Daniel Lezcano [[email protected]] wrote:
>>
>>> Sukadev Bhattiprolu wrote:
>>>
>>>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
>>>>
>>>> To support application checkpoint/restart, a task must have the same pid it
>>>> had when it was checkpointed. When containers are nested, the tasks within
>>>> the containers exist in multiple pid namespaces and hence have multiple pids
>>>> to specify during restart.
>>>>
>>>> This patchset implements a new system call, clone3() that lets a process
>>>> specify the pids of the child process.
>>>>

[...]

>>> Otherwise, shouldn't be more convenient to have something usable for
>>> everyone, let's say:
>>>
>>> cloneat(pid_t pid, pid_t desiredpid, ...);
>>>
>>> Where 'desiredpid' is a hint of for the kernel for the pid to be
>>> allocated (zero means the kernel will choose one for us) and the newly
>>> allocated task is the son of 'pid'.
>>>
>> Hmm, so P1 would call cloneat() to create a child P3 _on behalf_ of process
>> P2 ? I did not know we had a requirement for that. Can you explain the
>> use-case more ? IOW, why can't P2 create the child P3 by itself ?
>>
> I forgot to mention a constraint with the specified pid : P2 has to be
> child of P1.
> In other word, you can not specify a pid to clonat which is not your
> descendant (including yourself).
> With this constraint I think there is no security issues.

Sounds dangerous. What if your descendant executed a setuid program ?

>
> Concerning of forking on behalf of another process, we can consider it
> is up to the caller / programmer to know what it does. If a process in

Before the user can program with this syscall, _you_ need to define
the semantics of this syscall. For example, do you freeze that other
process to ensure that the clone is consistent ?

> the process hierarchy exec'ed a program and we cloneat this process and
> then the program fails because of an "unexpected error", well, we should
> have not done that. A similar example is when the IPC are removed while
> they are used by some other processes.
>
> Here it is a interesting use case:
> * if you created a pid namespace, and, let's say, booted a system
> container where the container init is the "init" process, then with this
> call you can enter the container at any time by doing cloneat() followed
> by an exec of your command. I think that was a requirement when there
> were discussions around "sys_hijack".

Can you define more precisely what you mean by "enter" the container ?

If you simply want create a new process in the container, you can
achieve the same thing with a daemon, or a smart init process (in
there), or even ptrace tricks.

Also, there is a reason why sys_hijack() was hijacked away ... And
I honestly think that a syscall to force another process to clone
would be shot down by the kernel guys.

> Another point. It's another way to extend the exhausted clone flags as
> the cloneat can be called as a compatibility way, with cloneat(getpid(),
> 0, ... )

Which is what the proposed new clone_....() does.

>
>> Note also that 'desiredpid' must be a list of pids (one for each pid
>> namespaces that the child will belong to) and hence we need 'nr_pids'
>> to specify the list. Given that we are limited to 6 parameters to the
>> syscall, such parameters must be stuffed into 'struct clone_args'.
>>
>> So we should do something like:
>>
>> sys_clone3(u32 flags_low, pid_t pid, struct clone_args *carg,
>> pid_t *desired_pids)
>>
>> or (to match the name and parameters, move 'pid' parameter into clone_args)
>>
> Well, hiding multiple clone in one clone call is ... weird. AFAIR, there
> was a debate between kernel or userspace proctree creation but it looks
> like it's done from the kernel with this call.

It isn't multiple clones in one clone. The syscall creates *one* single
process. We just ask the kernel to assign a specific pid to that process.

And because processes that live in nested pid-namespace own multiple
pids (one for each level), we need to specify multiple pids (one for
each level) for this single process that we create.

Then, to create the entire restart process tree, we have to call this
system call as many times as the number of processes to restart.

And yes, this is all done in userspace. The _only_ kernel-space help
is an interface to request specific pid(s) for each restarted process.

> I don't really see a difference between sys_restart(pid_t pid , int fd,
> long flags) where pid_t is the topmost in the hierarchy, fd is a file
> descriptor to a structure "pid_t * + struct clone_args *" and flags is
> "PROCTREE".

What you describe here is the way openvz's kernel-side process tree
restart works. This is not the current approach.

> IMHO, it is nicer to recursively restore the process tree for the nested
> pid namespaces, that will be really an userspace process tree creation
> and cloneat will be your friend here :)

The restart.c utility from the userspace tools (user-cr) is your
friend here... I'm sure you'll think it's nice :p

>>> That looks more consistent with the "<syscall>at" family, 'openat',
>>> 'faccessat', 'readlinkat', etc ... and usable for something else than
>>> the checkpoint / restart.
>>>
>> The subtle difference though is that openat() does not open a file on
>> behalf of another process and so the 'at' suffix would not apply ?
>>
> Yes and no, depending of where you put the cursor. If you consider the
> 'at' suffix means a process context, then I agree with you, there is a
> difference because the cloneat will be out of the current process
> context. But if you consider the 'at' suffix as a context in general,
> and openat means "relatively to a file descriptor" and cloneat means
> "relatively to a pid namespace" the 'at' suffix may apply. But I agree
> that we are so used to call the posix "fork", that cloneat sounds scary :)

At this point, I really couldn't care less about how we name the
new syscall.

Check out the containers IRC channel for the latest pearls:
clone_plus_with_aloe() and clone_plus_with_aloe_3_args() are
the prominent contenders, together with xerox() and ditto()...

There you go; these should be enough to keep the discussion
around on life support for at least another week.

I really really really hope we can settle down on *a* name,
*any* name, and move forward. Amen.

Oren.

2009-10-19 23:50:14

by Matt Helsley

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

On Tue, Oct 20, 2009 at 06:31:20AM +0900, H. Peter Anvin wrote:
> On 10/20/2009 02:44 AM, Matt Helsley wrote:
>>> |
>>> | I know I'm late to this discussion, but why the name clone3()? It's
>>> | not consistent with any other convention used fo syscall naming,
>
> This assumption, of course, is just plain wrong. Look at the wait
> system calls, for example. However, when a small integer is used like
> that, it pretty much always reflects numbers of arguments.
>
>>> | AFAICS. I think a name like clone_ext() or clonex() (for extended)
>>> | might make more sense.
>>>
>>> Sure, we talked about calling it clone_extended() and I can go back
>>> to that.
>>>
>>> Only minor concern with that name was if this new call ever needs to
>>> be extended, what would we call it :-). With clone3() we could add a
>>> real/fake parameter and call it clone4() :-p
>>
>> Perhaps clone64 (somewhat like stat64 for example)?
>>
>
> I think that doesn't exactly reflect the nature of the changes.

Yes. Without adopting an impractical encoding scheme it's quite
unlikely a small number like 3 or 64 could exactly reflect all the
changes :). I don't think that's a realistic objection though so...

> clone3() is actually pretty good.

I agree.

Cheers,
-Matt Helsley

2009-10-20 00:51:34

by Matt Helsley

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

On Mon, Oct 19, 2009 at 05:47:43PM -0400, Oren Laadan wrote:
>
>
> Daniel Lezcano wrote:
> > Sukadev Bhattiprolu wrote:
> >> Daniel Lezcano [[email protected]] wrote:
> >>
> >>> Sukadev Bhattiprolu wrote:
> >>>
> >>>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
> >>>>

<snip>

> > Another point. It's another way to extend the exhausted clone flags as
> > the cloneat can be called as a compatibility way, with cloneat(getpid(),
> > 0, ... )
>
> Which is what the proposed new clone_....() does.

Just to be clear -- Suka's proposing to extend the clone flags. However I
don't believe reusing the "pid" parameters as Daniel seemed to suggest
was ever part of Suka's proposed changes.

<snip>

> > I don't really see a difference between sys_restart(pid_t pid , int fd,
> > long flags) where pid_t is the topmost in the hierarchy, fd is a file
> > descriptor to a structure "pid_t * + struct clone_args *" and flags is
> > "PROCTREE".

I think the difference has to do with keeping the code maintainable.

Clone creates the process so it's already involved in allocating and
assigning pids to the new task. Switching pids at sys_restart() would
add another point in the code where pids are allocated and assigned.
This suggests we may have to worry about introducing new obscure races
for anyone who's working on the pid allocator to be careful of. At
least when all the code is "localized" to the clone paths we can be
reasonably certain of proper maintenance.

<snip>

> I really really really hope we can settle down on *a* name,
> *any* name, and move forward. Amen.

clone3() seemed to be the leading contender from what I've read so far.
Does anyone still object to clone3() after reading the whole thread?

Cheers,
-Matt Helsley

2009-10-20 03:33:19

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Matt Helsley <[email protected]> writes:

> On Mon, Oct 19, 2009 at 05:47:43PM -0400, Oren Laadan wrote:
>>
>>
>> Daniel Lezcano wrote:
>> > Sukadev Bhattiprolu wrote:
>> >> Daniel Lezcano [[email protected]] wrote:
>> >>
>> >>> Sukadev Bhattiprolu wrote:
>> >>>
>> >>>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
>> >>>>
>
> <snip>
>
>> > Another point. It's another way to extend the exhausted clone flags as
>> > the cloneat can be called as a compatibility way, with cloneat(getpid(),
>> > 0, ... )
>>
>> Which is what the proposed new clone_....() does.
>
> Just to be clear -- Suka's proposing to extend the clone flags. However I
> don't believe reusing the "pid" parameters as Daniel seemed to suggest
> was ever part of Suka's proposed changes.
>
> <snip>
>
>> > I don't really see a difference between sys_restart(pid_t pid , int fd,
>> > long flags) where pid_t is the topmost in the hierarchy, fd is a file
>> > descriptor to a structure "pid_t * + struct clone_args *" and flags is
>> > "PROCTREE".
>
> I think the difference has to do with keeping the code maintainable.
>
> Clone creates the process so it's already involved in allocating and
> assigning pids to the new task. Switching pids at sys_restart() would
> add another point in the code where pids are allocated and assigned.
> This suggests we may have to worry about introducing new obscure races
> for anyone who's working on the pid allocator to be careful of. At
> least when all the code is "localized" to the clone paths we can be
> reasonably certain of proper maintenance.
>
> <snip>
>
>> I really really really hope we can settle down on *a* name,
>> *any* name, and move forward. Amen.
>
> clone3() seemed to be the leading contender from what I've read so far.
> Does anyone still object to clone3() after reading the whole thread?

I object to what clone3() is. The name is not particularly interesting.

The sanity checks for assigning pids are missing and there is a todo
about it. I am not comfortable with assigning pids to a new process
in a pid namespace with other processes user space processes executing
in it.

How we handle a clone extension depends critically on if we want to
create a processes for restart in user space or kernel space.

Could some one give me or point me at a strong case for creating the
processes for restart in user space?

The pid assignment code is currently ugly. I asked that we just pass
in the min max pid pids that already exist into the core pid
assignment function and a constrained min/max that only admits a
single pid when we are allocating a struct pid for restart. That was
not done and now we have a weird abortion with unnecessary special cases.

Eric

2009-10-20 04:01:48

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Eric W. Biederman [[email protected]] wrote:
| > clone3() seemed to be the leading contender from what I've read so far.
| > Does anyone still object to clone3() after reading the whole thread?
|
| I object to what clone3() is. The name is not particularly interesting.
|
| The sanity checks for assigning pids are missing and there is a todo
| about it. I am not comfortable with assigning pids to a new process
| in a pid namespace with other processes user space processes executing
| in it.

Could you clarify ? How is the call to alloc_pidmap() from clone3() different
from the call from clone() itself ?

|
| How we handle a clone extension depends critically on if we want to
| create a processes for restart in user space or kernel space.
|
| Could some one give me or point me at a strong case for creating the
| processes for restart in user space?

There has been a lot of discussion on this with reference to the
Checkpoint/Restart patchset. See http://lkml.org/lkml/2009/4/13/401
for instance.

|
| The pid assignment code is currently ugly. I asked that we just pass
| in the min max pid pids that already exist into the core pid
| assignment function and a constrained min/max that only admits a
| single pid when we are allocating a struct pid for restart. That was
| not done and now we have a weird abortion with unnecessary special cases.

I did post a version of the patch attemptint to implement that. As
pointed out in:

http://lkml.org/lkml/2009/8/17/445

we would need more checks in alloc_pidmap() to cover cases like min or max
being invalid or min being greater than max or max being greater than pid_max
etc. Those checks also made the code ugly (imo).

Sukadev

2009-10-20 10:46:12

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Sukadev Bhattiprolu <[email protected]> writes:

> Eric W. Biederman [[email protected]] wrote:
> | > clone3() seemed to be the leading contender from what I've read so far.
> | > Does anyone still object to clone3() after reading the whole thread?
> |
> | I object to what clone3() is. The name is not particularly interesting.
> |
> | The sanity checks for assigning pids are missing and there is a todo
> | about it. I am not comfortable with assigning pids to a new process
> | in a pid namespace with other processes user space processes executing
> | in it.
>
> Could you clarify ? How is the call to alloc_pidmap() from clone3() different
> from the call from clone() itself ?

I think it is totally inappropriate to assign pids in a pid namespace
where there are user space processes already running.

> | How we handle a clone extension depends critically on if we want to
> | create a processes for restart in user space or kernel space.
> |
> | Could some one give me or point me at a strong case for creating the
> | processes for restart in user space?
>
> There has been a lot of discussion on this with reference to the
> Checkpoint/Restart patchset. See http://lkml.org/lkml/2009/4/13/401
> for instance.

Just read it. Thank you. Now I am certain clone_with_pids() is
not useful functionality to be exporting to userspace.

The only real argument in favor of doing this in user space is greater
flexibility. I can see checkpointing/restoring a single thread process
without a pid namespace. Anything more and you are just asking for
trouble.

A design that weakens security. Increases maintenance costs. All for
an unreliable result seems like a bad one to me.

> | The pid assignment code is currently ugly. I asked that we just pass
> | in the min max pid pids that already exist into the core pid
> | assignment function and a constrained min/max that only admits a
> | single pid when we are allocating a struct pid for restart. That was
> | not done and now we have a weird abortion with unnecessary special cases.
>
> I did post a version of the patch attemptint to implement that. As
> pointed out in:
>
> http://lkml.org/lkml/2009/8/17/445
>
> we would need more checks in alloc_pidmap() to cover cases like min or max
> being invalid or min being greater than max or max being greater than pid_max
> etc. Those checks also made the code ugly (imo).

If you need more checks you are doing it wrong. The code already has min
and max values, and even a start value. I was just strongly suggesting
we generalize where we get the values from, and then we have not special
cases.

Eric

2009-10-21 00:39:08

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Quoting Eric W. Biederman ([email protected]):
> Matt Helsley <[email protected]> writes:
>
> > On Mon, Oct 19, 2009 at 05:47:43PM -0400, Oren Laadan wrote:
> >>
> >>
> >> Daniel Lezcano wrote:
> >> > Sukadev Bhattiprolu wrote:
> >> >> Daniel Lezcano [[email protected]] wrote:
> >> >>
> >> >>> Sukadev Bhattiprolu wrote:
> >> >>>
> >> >>>> Subject: [RFC][v8][PATCH 0/10] Implement clone3() system call
> >> >>>>
> >
> > <snip>
> >
> >> > Another point. It's another way to extend the exhausted clone flags as
> >> > the cloneat can be called as a compatibility way, with cloneat(getpid(),
> >> > 0, ... )
> >>
> >> Which is what the proposed new clone_....() does.
> >
> > Just to be clear -- Suka's proposing to extend the clone flags. However I
> > don't believe reusing the "pid" parameters as Daniel seemed to suggest
> > was ever part of Suka's proposed changes.
> >
> > <snip>
> >
> >> > I don't really see a difference between sys_restart(pid_t pid , int fd,
> >> > long flags) where pid_t is the topmost in the hierarchy, fd is a file
> >> > descriptor to a structure "pid_t * + struct clone_args *" and flags is
> >> > "PROCTREE".
> >
> > I think the difference has to do with keeping the code maintainable.
> >
> > Clone creates the process so it's already involved in allocating and
> > assigning pids to the new task. Switching pids at sys_restart() would
> > add another point in the code where pids are allocated and assigned.
> > This suggests we may have to worry about introducing new obscure races
> > for anyone who's working on the pid allocator to be careful of. At
> > least when all the code is "localized" to the clone paths we can be
> > reasonably certain of proper maintenance.
> >
> > <snip>
> >
> >> I really really really hope we can settle down on *a* name,
> >> *any* name, and move forward. Amen.
> >
> > clone3() seemed to be the leading contender from what I've read so far.
> > Does anyone still object to clone3() after reading the whole thread?
>
> I object to what clone3() is. The name is not particularly interesting.
>
> The sanity checks for assigning pids are missing and there is a todo
> about it. I am not comfortable with assigning pids to a new process
> in a pid namespace with other processes user space processes executing
> in it.
>
> How we handle a clone extension depends critically on if we want to
> create a processes for restart in user space or kernel space.
>
> Could some one give me or point me at a strong case for creating the
> processes for restart in user space?
>
> The pid assignment code is currently ugly. I asked that we just pass
> in the min max pid pids that already exist into the core pid
> assignment function and a constrained min/max that only admits a
> single pid when we are allocating a struct pid for restart. That was
> not done and now we have a weird abortion with unnecessary special cases.

I asked you (I believe twice) to clarify how on earth you meant for
that to be done for hierarchical pid namespaces (a task being restored
which needs two of it's 4 pids specified), and you did not reply.

Did you mean for it to be done through procfiles? If so, does the
task have to keep multiple /proc mounts around, one for each
pidns hierarchy in which it needs to specify a pid? Or did you have
another idea in mind? A single procfile into which multiple pids can
be specified in a list? A completely different interface?

Or do you mean for this to be done only from the kernel?

thanks,
-serge

2009-10-21 00:39:21

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Quoting Eric W. Biederman ([email protected]):
> Sukadev Bhattiprolu <[email protected]> writes:
> The only real argument in favor of doing this in user space is greater
> flexibility.

Yyyyup.

> I can see checkpointing/restoring a single thread process
> without a pid namespace. Anything more and you are just asking for
> trouble.
>
> A design that weakens security.

How does it weaken security? It requires CAP_SYS_ADMIN, and, when
targeted capabilities are added, it will require CAP_SYS_ADMIN to
only those pid namespaces in which we choose a pid.

> Increases maintenance costs.

Does it really? It re-uses most of the existing code. More so than
the only existing in-kernel restart code I've seen does.

> All for
> an unreliable result seems like a bad one to me.

I've personally always been on the fence as to whether we rebuild the
process tree in user-space or kernel. And I'm still on the fence, as
nothing you've said has convinced me otherwise.

> > | The pid assignment code is currently ugly. I asked that we just pass
> > | in the min max pid pids that already exist into the core pid
> > | assignment function and a constrained min/max that only admits a
> > | single pid when we are allocating a struct pid for restart. That was
> > | not done and now we have a weird abortion with unnecessary special cases.
> >
> > I did post a version of the patch attemptint to implement that. As
> > pointed out in:
> >
> > http://lkml.org/lkml/2009/8/17/445
> >
> > we would need more checks in alloc_pidmap() to cover cases like min or max
> > being invalid or min being greater than max or max being greater than pid_max
> > etc. Those checks also made the code ugly (imo).
>
> If you need more checks you are doing it wrong. The code already has min
> and max values, and even a start value. I was just strongly suggesting
> we generalize where we get the values from, and then we have not special
> cases.

(I'm not sure whether this argument is a separate one - regarding the
implementation of choosing the pid - from the kernel-vs-userspace one
or not, so will wait for a response to my other email about your API)

thanks,
-serge

2009-10-20 18:32:25

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Eric W. Biederman [[email protected]] wrote:
| > Could you clarify ? How is the call to alloc_pidmap() from clone3() different
| > from the call from clone() itself ?
|
| I think it is totally inappropriate to assign pids in a pid namespace
| where there are user space processes already running.

Honestly, I don't understand why it is inappropriate or how this differs
from normal clone() - which also assigns pids in own and ancestor pid
namespaces.

|
| > | How we handle a clone extension depends critically on if we want to
| > | create a processes for restart in user space or kernel space.
| > |
| > | Could some one give me or point me at a strong case for creating the
| > | processes for restart in user space?
| >
| > There has been a lot of discussion on this with reference to the
| > Checkpoint/Restart patchset. See http://lkml.org/lkml/2009/4/13/401
| > for instance.
|
| Just read it. Thank you.

Sorry. I should have mentioned the reason here. (Like you mention below),
flexibility is the main reason.

| Now I am certain clone_with_pids() is not useful functionality to be
| exporting to userspace.
|
| The only real argument in favor of doing this in user space is greater
| flexibility. I can see checkpointing/restoring a single thread process
| without a pid namespace. Anything more and you are just asking for
| trouble.
|
| A design that weakens security. Increases maintenance costs. All for
| an unreliable result seems like a bad one to me.
|
| > | The pid assignment code is currently ugly. I asked that we just pass
| > | in the min max pid pids that already exist into the core pid
| > | assignment function and a constrained min/max that only admits a
| > | single pid when we are allocating a struct pid for restart. That was
| > | not done and now we have a weird abortion with unnecessary special cases.
| >
| > I did post a version of the patch attemptint to implement that. As
| > pointed out in:
| >
| > http://lkml.org/lkml/2009/8/17/445
| >
| > we would need more checks in alloc_pidmap() to cover cases like min or max
| > being invalid or min being greater than max or max being greater than pid_max
| > etc. Those checks also made the code ugly (imo).
|
| If you need more checks you are doing it wrong. The code already has min
| and max values, and even a start value. I was just strongly suggesting
| we generalize where we get the values from, and then we have not special
| cases.

Well, if alloc_pidmap(pid_ns, min, max) does not have to check the
parameters passed in (ie assumes that callers pass it in correctly)
it might be simple. But when user specifies the pid, the

min == max == user's target pid

so we will need to check the values either here or in callers.

Yes the code already has values and a start value. But these are
controlled by alloc_pidmap() and not passed in from the user space.

alloc_pidmap() needs to assign the next available pid or a specific
target pid. Generalizing it to alloc a pid in a range seemed be a
bit of an over kill for currently known usages.

I will post a version of the patch outside this patchset with min
and max parameters and we can see if it can be optimized/beautified.

Sukadev

2009-10-20 19:27:04

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Sukadev Bhattiprolu <[email protected]> writes:

> Eric W. Biederman [[email protected]] wrote:
> | > Could you clarify ? How is the call to alloc_pidmap() from clone3() different
> | > from the call from clone() itself ?
> |
> | I think it is totally inappropriate to assign pids in a pid namespace
> | where there are user space processes already running.
>
> Honestly, I don't understand why it is inappropriate or how this differs
> from normal clone() - which also assigns pids in own and ancestor pid
> namespaces.

The fact we can specify which pids we want. I won't claim it is as
exploitable as NULL pointer deferences have been but it has that kind
of feel to it.

> | > | How we handle a clone extension depends critically on if we want to
> | > | create a processes for restart in user space or kernel space.
> | > |
> | > | Could some one give me or point me at a strong case for creating the
> | > | processes for restart in user space?
> | >
> | > There has been a lot of discussion on this with reference to the
> | > Checkpoint/Restart patchset. See http://lkml.org/lkml/2009/4/13/401
> | > for instance.
> |
> | Just read it. Thank you.
>
> Sorry. I should have mentioned the reason here. (Like you mention below),
> flexibility is the main reason.
>
> | Now I am certain clone_with_pids() is not useful functionality to be
> | exporting to userspace.
> |
> | The only real argument in favor of doing this in user space is greater
> | flexibility. I can see checkpointing/restoring a single thread process
> | without a pid namespace. Anything more and you are just asking for
> | trouble.
> |
> | A design that weakens security. Increases maintenance costs. All for
> | an unreliable result seems like a bad one to me.
> |
> | > | The pid assignment code is currently ugly. I asked that we just pass
> | > | in the min max pid pids that already exist into the core pid
> | > | assignment function and a constrained min/max that only admits a
> | > | single pid when we are allocating a struct pid for restart. That was
> | > | not done and now we have a weird abortion with unnecessary special cases.
> | >
> | > I did post a version of the patch attemptint to implement that. As
> | > pointed out in:
> | >
> | > http://lkml.org/lkml/2009/8/17/445
> | >
> | > we would need more checks in alloc_pidmap() to cover cases like min or max
> | > being invalid or min being greater than max or max being greater than pid_max
> | > etc. Those checks also made the code ugly (imo).
> |
> | If you need more checks you are doing it wrong. The code already has min
> | and max values, and even a start value. I was just strongly suggesting
> | we generalize where we get the values from, and then we have not special
> | cases.
>
> Well, if alloc_pidmap(pid_ns, min, max) does not have to check the
> parameters passed in (ie assumes that callers pass it in correctly)
> it might be simple. But when user specifies the pid, the
>
> min == max == user's target pid
>
> so we will need to check the values either here or in callers.

Agreed. When you are talking about the target pid. That code path
needs the extra check.

> Yes the code already has values and a start value. But these are
> controlled by alloc_pidmap() and not passed in from the user space.

I was only thinking passed in from someplace else in kernel/pid.c

> alloc_pidmap() needs to assign the next available pid or a specific
> target pid. Generalizing it to alloc a pid in a range seemed be a
> bit of an over kill for currently known usages.

alloc_pidmap in assigning the next available pid is allocating a pid
in a range.

> I will post a version of the patch outside this patchset with min
> and max parameters and we can see if it can be optimized/beautified.

Thanks,
Eric

2009-10-20 20:13:12

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call



Eric W. Biederman wrote:
> Sukadev Bhattiprolu <[email protected]> writes:
>
>> Eric W. Biederman [[email protected]] wrote:
>> | > Could you clarify ? How is the call to alloc_pidmap() from clone3() different
>> | > from the call from clone() itself ?
>> |
>> | I think it is totally inappropriate to assign pids in a pid namespace
>> | where there are user space processes already running.
>>
>> Honestly, I don't understand why it is inappropriate or how this differs
>> from normal clone() - which also assigns pids in own and ancestor pid
>> namespaces.
>
> The fact we can specify which pids we want. I won't claim it is as
> exploitable as NULL pointer deferences have been but it has that kind
> of feel to it.

This security concern was first brought up by Linus, and to address it
we made clone3() require that the user be privileged to select pids.

But, honestly, a clone3() that allows the caller to request a specific
pid is like a restart() syscall that allows the caller to restore a
process with its original pid: you would simply checkpoint, and then
alter the pid in the checkpoint image and restart, repeat ad infinitum.

>From your security prism, they are equivalent: they practically allow
a user to have an arbitrary process with a selected pid. So it doesn't
really matter - for this security concern - if you select the pid in
the kernel through restart() or from userspace through clone3().

Sure, you can also "choose" a pid today, by repeatedly forking until
you get what you want... but that is harder to exploit. In contrast,
both clone3() and restart() allow pid selection, instantaneously.
Which is why both require privileges if the caller wants to select/
restore pids.

Oren.

2009-10-21 04:26:15

by Michael Kerrisk

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

On Tue, Oct 20, 2009 at 1:50 AM, Matt Helsley <[email protected]> wrote:
> On Tue, Oct 20, 2009 at 06:31:20AM +0900, H. Peter Anvin wrote:
>> On 10/20/2009 02:44 AM, Matt Helsley wrote:
>>>> |
>>>> | I know I'm late to this discussion, but why the name clone3()? It's
>>>> | not consistent with any other convention used fo syscall naming,
>>
>> This assumption, of course, is just plain wrong. ?Look at the wait
>> system calls, for example. ?However, when a small integer is used like
>> that, it pretty much always reflects numbers of arguments.
>>
>>>> | AFAICS. I think a name like clone_ext() or clonex() (for extended)
>>>> | might make more sense.
>>>>
>>>> Sure, we talked about calling it clone_extended() and I can go back
>>>> to that.
>>>>
>>>> Only minor concern with that name was if this new call ever needs to
>>>> be extended, what would we call it :-). With clone3() we could add a
>>>> real/fake parameter and call it clone4() :-p
>>>
>>> Perhaps clone64 (somewhat like stat64 for example)?
>>>
>>
>> I think that doesn't exactly reflect the nature of the changes.
>
> Yes. Without adopting an impractical encoding scheme it's quite
> unlikely a small number like 3 or 64 could exactly reflect all the
> changes :). I don't think that's a realistic objection though so...
>
>> clone3() is actually pretty good.
>
> I agree.

My question here is: what does "3" actually mean? In general, system
calls have not followed any convention of numbering to indicate
successive versions -- clone2() being the one possible exception that
I know of.

The only other conventions used for numbering new versions of system
calls relates either to arguments size (e.g., 32 versus 64) or to
their number number of arguments (dup2(), dup3(), wait3(), wait4(),
accept4(), eventfd2(), inotify_init1(), epoll_create1(), evetfd2(),
signalfd4()). The former convention makes some sense, but the latter
is rather questionable, for a couple of reasons. One is that the
number of arguments for the system call may change in the future
(several of the newer system calls have a flags argument which could
be used to indeicate the presence of an additional, optional
argument). Another reason that the latter convention is questionable
is that the number of arguments exposed to userspace by glibc may be
different from the number of arguments in the raw syscall. For
example, signafd4() has 4 arguments, but the glibc interface
(signalfd() http://www.kernel.org/doc/man-pages/online/pages/man2/signalfd.2.html)
has 3.

Using the name clone3() follows no historical convention, which is why
it seems unwise to me. Thus my suggestion of clonex() (like e.g.,
adjtimex()), though quite possibly there could be some better name.

Sukadev, you wrote "With clone3() we could add areal/fake parameter
and call it clone4()". This rasies for me the question: should
clone3() have a flags argument, so as to allow these types of
extensions (i.e., not for clone flags, but rather to indicate changes
in the system call interface). Yes, I understand there is a 64-bit
flags in 'struct clone_struct', but I wonder whether there is any
virtue in having an additional flags argument in the base signature of
the function, as per the following:
http://www.kernel.org/doc/man-pages/online/pages/man2/dup3.2.html
http://www.kernel.org/doc/man-pages/online/pages/man2/signalfd4.2.html
http://www.kernel.org/doc/man-pages/online/pages/man2/eventfd2.2.html
see also
http://linux-man-pages.blogspot.com/2008/10/recent-changes-in-file-descriptor.html

By the way, one further thought: why "struct clone_struct"? We know
it's a struct. Therefore, it seems pointless to include that in the
name. Something like "struct clone_args" would seem less redundant and
slightly more meaningful as a name.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

2009-10-21 06:18:53

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Eric W. Biederman [[email protected]] wrote:
| > I will post a version of the patch outside this patchset with min
| > and max parameters and we can see if it can be optimized/beautified.
|
| Thanks,
| Eric

Here is the patch. alloc_pid() calls alloc_pidmap() as follows:

- nr = alloc_pidmap(tmp);
+ min = max = 0;
+ if (target_pids) {
+ min = target_pids[i];
+ max = min + 1;
+ }
+ nr = alloc_pidmap(tmp, min, max);

It does look like this patch executes a bit more code even in the common
case but let me know if you think this is better.

---

From: Sukadev Bhattiprolu <suka@suka.(none)>
Date: Tue, 20 Oct 2009 17:01:18 -0700
Subject: [PATCH] Add min and max parameters to alloc_pidmap()

With support for setting a specific pid number for a process, alloc_pidmap()
will need a 'min' and a 'max' parameter.

---
kernel/pid.c | 47 ++++++++++++++++++++++++++++++++++-------------
1 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index c4d9914..56e13c1 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -147,18 +147,30 @@ static int alloc_pidmap_page(struct pidmap *map)
return 0;
}

-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int alloc_pidmap(struct pid_namespace *pid_ns, int min, int max)
{
int i, offset, max_scan, pid, last = pid_ns->last_pid;
int rc = -EAGAIN;
struct pidmap *map;

- pid = last + 1;
- if (pid >= pid_max)
- pid = RESERVED_PIDS;
+ if (min < 0 || max < 0 || max > pid_max)
+ return -EINVAL;
+
+ if (!max)
+ max = pid_max;
+
+ pid = min;
+ if (pid && pid >= max)
+ return -EINVAL;
+ else if (!pid) {
+ pid = last + 1;
+ if (pid >= pid_max)
+ pid = RESERVED_PIDS;
+ }
+
offset = pid & BITS_PER_PAGE_MASK;
map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
- max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+ max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
for (i = 0; i <= max_scan; ++i) {
rc = alloc_pidmap_page(map);
if (rc)
@@ -168,7 +180,12 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
do {
if (!test_and_set_bit(offset, map->page)) {
atomic_dec(&map->nr_free);
- pid_ns->last_pid = pid;
+ /*
+ * 'last_pid' only makes sense when
+ * choosing from entire range
+ */
+ if (!min)
+ pid_ns->last_pid = pid;
return pid;
}
offset = find_next_offset(map, offset);
@@ -179,22 +196,26 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
* bitmap block and the final block was the same
* as the starting point, pid is before last_pid.
*/
- } while (offset < BITS_PER_PAGE && pid < pid_max &&
+ } while (offset < BITS_PER_PAGE && pid < max &&
(i != max_scan || pid < last ||
!((last+1) & BITS_PER_PAGE_MASK)));
}
- if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
+ if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
++map;
offset = 0;
} else {
map = &pid_ns->pidmap[0];
offset = RESERVED_PIDS;
- if (unlikely(last == offset)) {
- rc = -EAGAIN;
- break;
- }
}
pid = mk_pid(pid_ns, map, offset);
+ /*
+ * If we are back to where we started, well, no pids are
+ * currently available in selected range.
+ */
+ if (pid < min || unlikely(last == offset)) {
+ rc = -EAGAIN;
+ break;
+ }
}
return rc;
}
@@ -272,7 +293,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)

tmp = ns;
for (i = ns->level; i >= 0; i--) {
- nr = alloc_pidmap(tmp);
+ nr = alloc_pidmap(tmp, 0, 0);
if (nr < 0)
goto out_free;

--
1.6.0.4

2009-10-21 08:37:42

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall

>
> > +struct clone_struct {
> > + u64 flags;
> > + u64 child_stack;
>
> u64 seems wrong on 32 bit platforms. ulong?

That would make it incompatible between 64 bit kernels and
32 bit user space, requiring a wrapper. Better leave it at u64.

> > + If a pid in the @pids list is 0, the kernel will assign the next
> > + available pid in the pid namespace, for the process.
> > +
> > + If a pid in the @pids list is non-zero, the kernel tries to assign
> > + the specified pid in that namespace. If that pid is already in use
> > + by another process, the system call fails with -EBUSY.
> ...
> > + On failure, clone3() returns -1 and sets 'errno' to one of following
> > + values (the child process is not created).
>
> Inconsistent with above. Syscalls really return -ERRCODE, errno is
> glibc magic.

Quite the opposite is true.

The man page describes what the user space sees, which is errno. Returning
-ERRCODE to libc from the kernel is part of the architecture specific
kernel ABI and should not be documented in this place but in the architecture
documentation.

> > + pid_t pids[] = { 77, 99 };
> > + struct clone_struct cs;
> > +
> > + cs.flags = (u64) SIGCHLD;
> > + cs.child_stack = (u64) setup_child_stack();
> > + cs.nr_pids = 2;
> > + cs.parent_tid = 0LL;
> > + cs.child_tid = 0LL;
> > +
> > + rc = syscall(__NR_clone3, &cs, pids);
>
> Hmm, is there reason why pids are not at the end of struct
> clone_struct? Passing most parameters in special structure, then pids
> separately is strange...

I suggested doing that, it's a lot easier to handle fixed length data
structures than an array at the end.

Arnd <><

2009-10-21 09:16:46

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Sukadev Bhattiprolu <[email protected]> writes:

> Eric W. Biederman [[email protected]] wrote:
> | > I will post a version of the patch outside this patchset with min
> | > and max parameters and we can see if it can be optimized/beautified.
> |
> | Thanks,
> | Eric
>
> Here is the patch. alloc_pid() calls alloc_pidmap() as follows:
>
> - nr = alloc_pidmap(tmp);
> + min = max = 0;
> + if (target_pids) {
> + min = target_pids[i];
> + max = min + 1;
> + }
> + nr = alloc_pidmap(tmp, min, max);
>
> It does look like this patch executes a bit more code even in the common
> case but let me know if you think this is better.
>
> ---

Not what I was thinking. The following untested patch is what I was
thinking. It just exposes last, min, and max to the callers which pass
in different appropriate values.

Eric

diff --git a/kernel/pid.c b/kernel/pid.c
index d3f722d..d9b6f82 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,17 +122,17 @@ static void free_pidmap(struct upid *upid)
atomic_inc(&map->nr_free);
}

-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min, int max)
{
- int i, offset, max_scan, pid, last = pid_ns->last_pid;
+ int i, offset, max_scan, pid;
struct pidmap *map;

pid = last + 1;
- if (pid >= pid_max)
- pid = RESERVED_PIDS;
+ if (pid >= max)
+ pid = min;
offset = pid & BITS_PER_PAGE_MASK;
map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
- max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+ max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
for (i = 0; i <= max_scan; ++i) {
if (unlikely(!map->page)) {
void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
@@ -153,7 +153,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
do {
if (!test_and_set_bit(offset, map->page)) {
atomic_dec(&map->nr_free);
- pid_ns->last_pid = pid;
return pid;
}
offset = find_next_offset(map, offset);
@@ -164,16 +163,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
* bitmap block and the final block was the same
* as the starting point, pid is before last_pid.
*/
- } while (offset < BITS_PER_PAGE && pid < pid_max &&
+ } while (offset < BITS_PER_PAGE && pid < max &&
(i != max_scan || pid < last ||
!((last+1) & BITS_PER_PAGE_MASK)));
}
- if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
+ if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
++map;
offset = 0;
} else {
map = &pid_ns->pidmap[0];
- offset = RESERVED_PIDS;
+ offset = min;
if (unlikely(last == offset))
break;
}
@@ -182,6 +181,25 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
return -1;
}

+static int alloc_pidmap(struct pid_namespace *pid_ns)
+{
+ int nr;
+
+ nr = do_alloc_pidmap(pid_ns, pid_ns->last, RESERVED_PIDS, pid_max);
+ if (nr >= 0)
+ pid_ns->last_pid = nr;
+ return nr;
+}
+
+static int set_pidmap(struct pid_namespace *pid_ns, int target)
+{
+ if (target >= pid_max)
+ return -1;
+ if (target < RESERVED_PIDS)
+ return -1;
+ return do_alloc_pidmap(pid_ns, target - 1, target, target + 1);
+}
+
int next_pidmap(struct pid_namespace *pid_ns, int last)
{
int offset;

2009-10-21 09:33:42

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall

On Wed 2009-10-21 10:37:38, Arnd Bergmann wrote:
> >
> > > +struct clone_struct {
> > > + u64 flags;
> > > + u64 child_stack;
> >
> > u64 seems wrong on 32 bit platforms. ulong?
>
> That would make it incompatible between 64 bit kernels and
> 32 bit user space, requiring a wrapper. Better leave it at u64.

Ok.

> > > + If a pid in the @pids list is 0, the kernel will assign the next
> > > + available pid in the pid namespace, for the process.
> > > +
> > > + If a pid in the @pids list is non-zero, the kernel tries to assign
> > > + the specified pid in that namespace. If that pid is already in use
> > > + by another process, the system call fails with -EBUSY.
> > ...
> > > + On failure, clone3() returns -1 and sets 'errno' to one of following
> > > + values (the child process is not created).
> >
> > Inconsistent with above. Syscalls really return -ERRCODE, errno is
> > glibc magic.
>
> Quite the opposite is true.

Well, it is still inconsistent. Half the docs talks -ERRCODE, half
talks -1/errno=ERRCODE.

> > > + pid_t pids[] = { 77, 99 };
> > > + struct clone_struct cs;
> > > +
> > > + cs.flags = (u64) SIGCHLD;
> > > + cs.child_stack = (u64) setup_child_stack();
> > > + cs.nr_pids = 2;
> > > + cs.parent_tid = 0LL;
> > > + cs.child_tid = 0LL;
> > > +
> > > + rc = syscall(__NR_clone3, &cs, pids);
> >
> > Hmm, is there reason why pids are not at the end of struct
> > clone_struct? Passing most parameters in special structure, then pids
> > separately is strange...
>
> I suggested doing that, it's a lot easier to handle fixed length data
> structures than an array at the end.

You could still put the something * pids at the end of (fixed length)
structure.

(Not that I agree with the argument, pid array is variable-length,
anyway, and inlining it at the end of structure should not make code
more complex).

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-10-21 13:12:09

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

On 10/21/2009 01:26 PM, Michael Kerrisk wrote:
>
> My question here is: what does "3" actually mean? In general, system
> calls have not followed any convention of numbering to indicate
> successive versions -- clone2() being the one possible exception that
> I know of.
>

"3" is number of arguments. It's better than "extended" or something
like that simply because "extended" just means "more than", and a number
at least tells you *how much more than*.

-hpa

2009-10-21 13:26:56

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall

On Wednesday 21 October 2009, Pavel Machek wrote:
> > > Hmm, is there reason why pids are not at the end of struct
> > > clone_struct? Passing most parameters in special structure, then pids
> > > separately is strange...
> >
> > I suggested doing that, it's a lot easier to handle fixed length data
> > structures than an array at the end.
>
> You could still put the something * pids at the end of (fixed length)
> structure.
>
> (Not that I agree with the argument, pid array is variable-length,
> anyway, and inlining it at the end of structure should not make code
> more complex).

Please read up on the v1 to v7 patches, we've been through all this
before. There are obviously a lot of ways to do it, many of which are
equally good. The important thing to note is that the user-visible
interface should *not* pass the structure into the library that provides
this function but build the structure itself, so it doesn't matter that
much.

If you put the array at the end of the data structure, the wrapper
will have to copy the array provided by the user to a temporary
buffer. Passing the array as a separate syscall argument avoids
this.

Arnd <><

2009-10-21 15:53:25

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Oren Laadan wrote:
>
> Daniel Lezcano wrote:
[ ... ]

>> I forgot to mention a constraint with the specified pid : P2 has to be
>> child of P1.
>> In other word, you can not specify a pid to clonat which is not your
>> descendant (including yourself).
>> With this constraint I think there is no security issues.
>
> Sounds dangerous. What if your descendant executed a setuid program ?

That does not happen because you inherit the context of the caller.

>> Concerning of forking on behalf of another process, we can consider it
>> is up to the caller / programmer to know what it does. If a process in
>
> Before the user can program with this syscall, _you_ need to define
> the semantics of this syscall.
Yes, you are right. Here it is the proposition of the semantics.

Function prototype is:

pid_t cloneat(pid_t pid, pid_t hint, struct clone_args *args);

Structure types are:

typedef int clone_flag_t;

struct clone_args {
clone_flag_t *flags;
int flags_size;
u32 reserved1;
u32 reserved2;
u64 child_stack_base;
u64 child_stack_size;
u64 parent_tid_ptr;
u64 child_tid_ptr;
u64 reserved3;
};

With the helper macros:

void CLONE_SET(int flag, clone_flag_t *flags);
void CLONE_CLR(int flag, clone_flag_t *flags);
bool CLONE_ISSET(int flag, clone_flag_t *flags);
void CLONE_ZERO(flag_t *clone_flags);

And:

#define CLONEXT_VM 0x20 /* CLONE_VM>>3 */
#define CLONEXT_FS 0x21
#define CLONEXT_FILES 0x22
...

The function clones the current task and reparent the child to the
process specified in the 'pid' parameter and copy the nsproxy from it
(if different).

If the 'hint' parameter is different from zero, then the 'hint' value
will be the pid of the child task, otherwise a value is chosen
automatically by the system like the usual clone. If the 'hint' is
specified and the task id is already in use, then the call fails.

The syscall returns the child task id on success, < 0 otherwise.

The specified 'pid' _must_ be a descendant of the caller. This is more
consistent with the inherited resources with clone in a process
hierarchy and less dangerous than allowing to cloneat everywhere. The
caller is at the topmost process hierarchy the cloneat is allowed.

It is not possible for the caller to wait for a process created with
cloneat where the resulting tasks is not its direct child.

The resources have to be shared across the process hierarchy in order to
use the right flags to clone. eg. it is not possible to create a thread
to a child process.

For example:

P 1
|
fork()__________ P 2
|
|
|
|
cloneat(P2, 0, { ... flags[0] & CLONE_VM ... } => -EINVAL;


The cloneat syscall can be used for the following use cases:

* checkpoint / restart:

The restart can be done with a clone(.., CLONE_NEWPID|...);
Then the new pid (aka pid 1) retrieves the proctree from the statefile
and creates the different tasks with the process hierarchy with the
cloneat syscall.

The proctree creation can be done from outside of the pid namespace or
from inside.

Concerning nested pid namespaces, IMHO I would not try to checkpoint /
restart them. The checkpoint of a nested pid namespace should be
forbidden except for the leaf of a pid namespaces tree. That should
allow to do partial process tree checkpoint if the application is aware
of that and creates a new pid namespace for each subtree it wants to be
checkpointed.

If this is too restrictive, the struct clone_args can be added with 2
other fields "unused" for future use.

* execute a command in a container:

If we have a container with the container init process which is usually
a child reaper (otherwise daemons are not supported or we have zombie
factory), we can easily cloneat(initpid, 0, ...) and exec a command. As
the processes of the container are always reparented to the container
init, it is safe to do that.

* clone syscall compatibility + extended clone flags

The cloneat function can be used like the usual clone function with:

cloneat(getpid(), 0, clone_args);

And the extended clone flags can be used.


[ ... ]

> Can you define more precisely what you mean by "enter" the container ?
>
> If you simply want create a new process in the container, you can
> achieve the same thing with a daemon, or a smart init process (in
> there), or even ptrace tricks.

Yes, you can launch a daemon inside the container, that works for a
system container because the container is killed by killing the first
process of the container or by a shutdown inside the container (not
fully implemented in the kernel).
But this is unreliable for application containers, I won't enter in the
details but the container exits when the application exits, with a
daemon inside the container, this is no longer the case because you can
not detect the application death as the daemon is always there.

With cloneat you restrict the life cycle of the command you launched,
that is the container exits as soon as all the processes exited the
container, including the spawned command itself.

> Also, there is a reason why sys_hijack() was hijacked away ... And
> I honestly think that a syscall to force another process to clone
> would be shot down by the kernel guys.
Maybe, maybe not. CLONE_PARENT exists and looks similar to cloneat.

>> Another point. It's another way to extend the exhausted clone flags as
>> the cloneat can be called as a compatibility way, with cloneat(getpid(),
>> 0, ... )
>
> Which is what the proposed new clone_....() does.
Yes, right. What I meant is we still have the clone extension feature
you have with clone_with_pids.

>>> Note also that 'desiredpid' must be a list of pids (one for each pid
>>> namespaces that the child will belong to) and hence we need 'nr_pids'
>>> to specify the list. Given that we are limited to 6 parameters to the
>>> syscall, such parameters must be stuffed into 'struct clone_args'.
>>>
>>> So we should do something like:
>>>
>>> sys_clone3(u32 flags_low, pid_t pid, struct clone_args *carg,
>>> pid_t *desired_pids)
>>>
>>> or (to match the name and parameters, move 'pid' parameter into clone_args)
>>>
>> Well, hiding multiple clone in one clone call is ... weird. AFAIR, there
>> was a debate between kernel or userspace proctree creation but it looks
>> like it's done from the kernel with this call.
>
> It isn't multiple clones in one clone. The syscall creates *one* single
> process. We just ask the kernel to assign a specific pid to that process.
>
> And because processes that live in nested pid-namespace own multiple
> pids (one for each level), we need to specify multiple pids (one for
> each level) for this single process that we create.
>
> Then, to create the entire restart process tree, we have to call this
> system call as many times as the number of processes to restart.
>
> And yes, this is all done in userspace. The _only_ kernel-space help
> is an interface to request specific pid(s) for each restarted process.
Aaah, Ok ! I understand better now. Thanks for clarifying this point.

>> I don't really see a difference between sys_restart(pid_t pid , int fd,
>> long flags) where pid_t is the topmost in the hierarchy, fd is a file
>> descriptor to a structure "pid_t * + struct clone_args *" and flags is
>> "PROCTREE".
Ok. Never mind :)


[ ... ]

>
> At this point, I really couldn't care less about how we name the
> new syscall.
>
> Check out the containers IRC channel for the latest pearls:
> clone_plus_with_aloe() and clone_plus_with_aloe_3_args() are
> the prominent contenders, together with xerox() and ditto()...
>
> There you go; these should be enough to keep the discussion
> around on life support for at least another week.
>
> I really really really hope we can settle down on *a* name,
> *any* name, and move forward. Amen.
Amen and Alea Jacta Est :)

Thanks
-- Daniel

2009-10-21 18:26:11

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 10/10]: Document clone3() syscall

Pavel Machek [[email protected]] wrote:
| Hi!
|
| > This gives a brief overview of the clone3() system call. We should
|
| Thanks!
|
| > eventually describe more details in existing clone(2) man page or in
| > a new man page.
|
| M. Kerrisk (see MAINTAINERS) maintains man pages...

Ok. I copied linux-api and M. Kerrisk is looking at this patchset.

|
| > Changelog[v8]:
| > - clone2() is already in use in IA64. Rename syscall to clone3()
| ...
| > Index: linux-2.6/Documentation/clone2
| > ===================================================================
| > --- /dev/null 1970-01-01 00:00:00.000000000 +0000
| > +++ linux-2.6/Documentation/clone2 2009-10-12 19:54:38.000000000 -0700
|
| clone3?

Ah, yes, will fix.

|
| > +struct clone_struct {
| > + u64 flags;
| > + u64 child_stack;
|
| u64 seems wrong on 32 bit platforms. ulong?
|
| > + u32 nr_pids;
|
| So nr_pids is either 1 or 2?

No. With pid namespaces, which can be nested to arbitrary levels, each
process has several pid numbers - one in its own namespace and one in
each ancestor pid namespaces.

nr_pids specifies the number of pids the user cares about. IOW, if you
checkpoint a process that is 3 levels below the init-pid-ns, and you
care about the pids for those three levels, you would pass in an
array of 4 pid_ts and nr_pids would be 4.

|
| > + u32 reserved1;
| > + u64 parent_tid;
| > + u64 child_tid;
| > + u64 reserved2;
| > +};
| > +
|
|
| > + See CLONE_NEWPID section of clone(2) man page for details about pid
| > + namespaces.
| > +
| > + The order pids in @pids corresponds to the nesting order of pid-
| > + namespaces, with @pids[0] corresponding to the init_pid_ns.
|
| Ok, so I'm confused.

Hope the above note on pid-namespaces helps, if not let me know.

|
| > + If a pid in the @pids list is 0, the kernel will assign the next
| > + available pid in the pid namespace, for the process.
| > +
| > + If a pid in the @pids list is non-zero, the kernel tries to assign
| > + the specified pid in that namespace. If that pid is already in use
| > + by another process, the system call fails with -EBUSY.
| ...
| > + On failure, clone3() returns -1 and sets 'errno' to one of following
| > + values (the child process is not created).
|
| Inconsistent with above. Syscalls really return -ERRCODE, errno is
| glibc magic.

Ok. Will make it consistent to say that it returns one of the error codes.

2009-10-21 18:45:38

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call



Daniel Lezcano wrote:
> Oren Laadan wrote:
>>
>> Daniel Lezcano wrote:
> [ ... ]
>
>>> I forgot to mention a constraint with the specified pid : P2 has to
>>> be child of P1.
>>> In other word, you can not specify a pid to clonat which is not your
>>> descendant (including yourself).
>>> With this constraint I think there is no security issues.
>>
>> Sounds dangerous. What if your descendant executed a setuid program ?
>
> That does not happen because you inherit the context of the caller.
>
>>> Concerning of forking on behalf of another process, we can consider
>>> it is up to the caller / programmer to know what it does. If a
>>> process in
>>
>> Before the user can program with this syscall, _you_ need to define
>> the semantics of this syscall.
> Yes, you are right. Here it is the proposition of the semantics.
>
> Function prototype is:
>
> pid_t cloneat(pid_t pid, pid_t hint, struct clone_args *args);
>
> Structure types are:
>
> typedef int clone_flag_t;
>
> struct clone_args {
> clone_flag_t *flags;
> int flags_size;
> u32 reserved1;
> u32 reserved2;
> u64 child_stack_base;
> u64 child_stack_size;
> u64 parent_tid_ptr;
> u64 child_tid_ptr;
> u64 reserved3;
> };
>
> With the helper macros:
>
> void CLONE_SET(int flag, clone_flag_t *flags);
> void CLONE_CLR(int flag, clone_flag_t *flags);
> bool CLONE_ISSET(int flag, clone_flag_t *flags);
> void CLONE_ZERO(flag_t *clone_flags);
>
> And:
>
> #define CLONEXT_VM 0x20 /* CLONE_VM>>3 */ #define CLONEXT_FS
> 0x21
> #define CLONEXT_FILES 0x22
> ...
>

The main motivation for your new syscall is to make it possible to
inject a process into a namespace. IOW, what you are proposing is
a new incarnation of sys_hijack().

This is _orthogonal_ to the current discussion, which is about an
extension for clone to allow (a) choosing target pid(s), (b) more
flags, and (c) future extensions.

(Your suggested syscall may, too, allow the request a specific set
of pids for the child process, and reuse the current code for that).

I suggest that you start a new thread about your RFC. This will
reduce distractions on the current thread, and bring more focus to
your proposal. I surely will post some comments there :)

[...]

> The cloneat syscall can be used for the following use cases:
>
> * checkpoint / restart:
>
> The restart can be done with a clone(.., CLONE_NEWPID|...);
> Then the new pid (aka pid 1) retrieves the proctree from the statefile
> and creates the different tasks with the process hierarchy with the
> cloneat syscall.

s/cloneat/$CLONE3/
(hint: this is how it's done now)

>
> The proctree creation can be done from outside of the pid namespace or
> from inside.

Ew .. why would you do that ?

> Concerning nested pid namespaces, IMHO I would not try to checkpoint /
> restart them. The checkpoint of a nested pid namespace should be
> forbidden except for the leaf of a pid namespaces tree. That should

Others (me included) *will* try and may get upset if forbidden...
Seriously, there is no technical reason to restrict this.

>> Can you define more precisely what you mean by "enter" the container ?
>>
>> If you simply want create a new process in the container, you can
>> achieve the same thing with a daemon, or a smart init process (in
>> there), or even ptrace tricks.
>
> Yes, you can launch a daemon inside the container, that works for a
> system container because the container is killed by killing the first
> process of the container or by a shutdown inside the container (not
> fully implemented in the kernel).
> But this is unreliable for application containers, I won't enter in the
> details but the container exits when the application exits, with a
> daemon inside the container, this is no longer the case because you can
> not detect the application death as the daemon is always there.
>
> With cloneat you restrict the life cycle of the command you launched,
> that is the container exits as soon as all the processes exited the
> container, including the spawned command itself.

Then start a daemon _in addition_ to the application, or write a
daemon that will launch the application and monitor it... And also
there is ptrace -

But, please let's take this off to a new thread about adding how to
add a process into a namespace from the outside. FYI, I do think
such an interface may be useful and nicer than the two alternatives
I suggested above.

>> Also, there is a reason why sys_hijack() was hijacked away ... And
>> I honestly think that a syscall to force another process to clone
>> would be shot down by the kernel guys.
> Maybe, maybe not. CLONE_PARENT exists and looks similar to cloneat.

Actually, I misread previously; I mean not forcing another process
to clone, but instead forcing another process to become a parent (and
I shall ignore the ethical issues :)

I still suspect it won't be welcome. Several people would have liked
to see CLONE_PARENT go away, too, if that was possible without breaking
userspace applications. Yet another reason to take it to a discussion
of its own.

Oren.

2009-10-21 18:51:29

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Eric W. Biederman [[email protected]] wrote:
| Not what I was thinking. The following untested patch is what I was
| thinking. It just exposes last, min, and max to the callers which pass
| in different appropriate values.


Minor comments are that the caller has to choose which function to call
and passing in 'target-1' for the 'last' in set_pidmap() seems a bit
unnatural. But I can't think of a better way and I think this will work.
Will test it out.

Pavel Emelyanov - you too had some comments about this part of my code.
Let me know if this works for you.

Sukadev

2009-10-21 19:43:35

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

H. Peter Anvin [[email protected]] wrote:
> On 10/21/2009 01:26 PM, Michael Kerrisk wrote:
>>
>> My question here is: what does "3" actually mean? In general, system
>> calls have not followed any convention of numbering to indicate
>> successive versions -- clone2() being the one possible exception that
>> I know of.
>>
>
> "3" is number of arguments.

To me, it is a version number.

mmap() and mmap2() both have 6 parameters.

Besides if wait4() were born before wait3(), would it still be wait4() :-)
But I see that it is hard to get one-convention-that-fits-all.

> It's better than "extended" or something
> like that simply because "extended" just means "more than", and a number
> at least tells you *how much more than*.

And extended assumes we wont extend again.

An informal poll of reviewers has clone3() with a slight advantage :-)

clone_extended() camp: Serge Hallyn, Kerrisk, Louis Rilling,
clone3(): Sukadev, H. Peter Anvin, Oren, Matt Helsley.

I like clone3() but am not insisting on it. I just want a name...

Sukadev

2009-10-21 21:11:55

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Sukadev Bhattiprolu <[email protected]> writes:

> Eric W. Biederman [[email protected]] wrote:
> | Not what I was thinking. The following untested patch is what I was
> | thinking. It just exposes last, min, and max to the callers which pass
> | in different appropriate values.
>
>
> Minor comments are that the caller has to choose which function to call
> and passing in 'target-1' for the 'last' in set_pidmap() seems a bit
> unnatural. But I can't think of a better way and I think this will work.
> Will test it out.

Thanks. My primary concern was that we don't unnecessarily duplicate all
of the weird allocating a bitmap page code etc.

2009-10-21 22:12:38

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

On 10/22/2009 04:44 AM, Sukadev Bhattiprolu wrote:
>>
>> "3" is number of arguments.
>
> To me, it is a version number.
>
> mmap() and mmap2() both have 6 parameters.
>

You keep bringing this up. mmap2() is (a) a non-user-visible call; (b)
an exception (a mistake, if you want.)

> Besides if wait4() were born before wait3(), would it still be wait4() :-)

Yes. wait3() came before wait4(), but there never was a wait2().

> But I see that it is hard to get one-convention-that-fits-all.
>
>> It's better than "extended" or something
>> like that simply because "extended" just means "more than", and a number
>> at least tells you *how much more than*.
>
> And extended assumes we wont extend again.

Exactly.

-hpa

2009-10-22 10:26:46

by Michael Kerrisk

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

Peter,

On Wed, Oct 21, 2009 at 3:03 PM, H. Peter Anvin <[email protected]> wrote:
> On 10/21/2009 01:26 PM, Michael Kerrisk wrote:
>>
>> My question here is: what does "3" actually mean? In general, system
>> calls have not followed any convention of numbering to indicate
>> successive versions -- clone2() being the one possible exception that
>> I know of.
>>
>
> "3" is number of arguments.

sys_clone3(struct clone_struct __user *ucs, pid_t __user *pids)

It appears to me that the number of arguments is 2.

> It's better than "extended" or something like
> that simply because "extended" just means "more than", and a number at least
> tells you *how much more than*.

I'm not sure why you think including a number in the name tells us
"how much more than". Unless you are considering the numbering to be
version numbers, which apparently is not what you mean.

Thanks,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

2009-10-22 10:40:10

by Michael Kerrisk

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

Sukadev,

On Wed, Oct 21, 2009 at 9:44 PM, Sukadev Bhattiprolu
<[email protected]> wrote:
> H. Peter Anvin [[email protected]] wrote:
>> On 10/21/2009 01:26 PM, Michael Kerrisk wrote:
>>>
>>> My question here is: what does "3" actually mean? In general, system
>>> calls have not followed any convention of numbering to indicate
>>> successive versions -- clone2() being the one possible exception that
>>> I know of.
>>>
>>
>> "3" is number of arguments.
>
> To me, it is a version number.

See my precending mail. Isn't the number of arguments "2".

> mmap() and mmap2() both have 6 parameters.
>
> Besides if wait4() were born before wait3(), would it still be wait4() :-)
> But I see that it is hard to get one-convention-that-fits-all.

Yes -- that's exactly right.

>> It's better than "extended" or something
>> like that simply because "extended" just means "more than", and a number
>> at least tells you *how much more than*.
>
> And extended assumes we wont extend again.

Well, if we do things right in this design, we may not need to ever
extend (by creating a new syscall) again. That's why I mentioned the
"flags" argument idea. Did you give this some thought?

> An informal poll of reviewers has clone3() with a slight advantage :-)
>
> ? ? ? ?clone_extended() camp: Serge Hallyn, Kerrisk, Louis Rilling,
> ? ? ? ?clone3(): Sukadev, H. Peter Anvin, Oren, Matt Helsley.
>
> I like clone3() but am not insisting on it. I just want a name...

And I'm not really insisting on a change. As you rightly point out,
there is much inconsistency in the naming conventions that have been
used over the years.

But, because there has been no consistency in the use of numbers, and
because the number of arguments that are presented in a glibc
interface may differ from the number of arguments in an underlying
syscall (several precedents: signalfd4(), pselect(), ppoll()), I'm
inclined to think that clonex() or clone_ext() is slighly better than
clone3(). But, certainly, my arguments are not compelling.

Thanks,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

2009-10-22 11:22:53

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Oren Laadan wrote:
>
> Daniel Lezcano wrote:
>> Oren Laadan wrote:
>>> Daniel Lezcano wrote:
>> [ ... ]
>>
>>>> I forgot to mention a constraint with the specified pid : P2 has to
>>>> be child of P1.
>>>> In other word, you can not specify a pid to clonat which is not your
>>>> descendant (including yourself).
>>>> With this constraint I think there is no security issues.
>>> Sounds dangerous. What if your descendant executed a setuid program ?
>> That does not happen because you inherit the context of the caller.
>>
>>>> Concerning of forking on behalf of another process, we can consider
>>>> it is up to the caller / programmer to know what it does. If a
>>>> process in
>>> Before the user can program with this syscall, _you_ need to define
>>> the semantics of this syscall.
>> Yes, you are right. Here it is the proposition of the semantics.
>>
>> Function prototype is:
>>
>> pid_t cloneat(pid_t pid, pid_t hint, struct clone_args *args);
>>
>> Structure types are:
>>
>> typedef int clone_flag_t;
>>
>> struct clone_args {
>> clone_flag_t *flags;
>> int flags_size;
>> u32 reserved1;
>> u32 reserved2;
>> u64 child_stack_base;
>> u64 child_stack_size;
>> u64 parent_tid_ptr;
>> u64 child_tid_ptr;
>> u64 reserved3;
>> };
>>
>> With the helper macros:
>>
>> void CLONE_SET(int flag, clone_flag_t *flags);
>> void CLONE_CLR(int flag, clone_flag_t *flags);
>> bool CLONE_ISSET(int flag, clone_flag_t *flags);
>> void CLONE_ZERO(flag_t *clone_flags);
>>
>> And:
>>
>> #define CLONEXT_VM 0x20 /* CLONE_VM>>3 */ #define CLONEXT_FS
>> 0x21
>> #define CLONEXT_FILES 0x22
>> ...
>>
>
> The main motivation for your new syscall is to make it possible to
> inject a process into a namespace. IOW, what you are proposing is
> a new incarnation of sys_hijack().
>
> This is _orthogonal_ to the current discussion, which is about an
> extension for clone to allow (a) choosing target pid(s), (b) more
> flags, and (c) future extensions.
>
> (Your suggested syscall may, too, allow the request a specific set
> of pids for the child process, and reuse the current code for that).
>
> I suggest that you start a new thread about your RFC. This will
> reduce distractions on the current thread, and bring more focus to
> your proposal. I surely will post some comments there :)

I can argue exactly the same thing, the main motivation for your new
syscall is to make it possible to restart a process tree for a
checkpoint / restart and this is orthogonal with adding extended clone
flags :)

But my main motivation is to have the possibility to a) choose a target
__and__ b) clone the process relatively to another one. These 2 features
allows to do what *we* need, that is recreate a process tree and the
bonus with this approach is the ability to inject a process into a
namespace, something asked by several people, eg. debug with gdb an
application running into another pid namespace (is not supported today).

I am sorry for coming late in the discussion and for distracting.

> [...]
>
>> The cloneat syscall can be used for the following use cases:
>>
>> * checkpoint / restart:
>>
>> The restart can be done with a clone(.., CLONE_NEWPID|...);
>> Then the new pid (aka pid 1) retrieves the proctree from the statefile
>> and creates the different tasks with the process hierarchy with the
>> cloneat syscall.
>
> s/cloneat/$CLONE3/
> (hint: this is how it's done now)
Of course, what is described is what you does with 'clone3' !
Do you think I will come proposing a variant of 'clone3' not doing what
you need ? :)

>> The proctree creation can be done from outside of the pid namespace or
>> from inside.
>
> Ew .. why would you do that ?
And why not. Is there a semantic specifying how a process tree should be
recreated ?

>> Concerning nested pid namespaces, IMHO I would not try to checkpoint /
>> restart them. The checkpoint of a nested pid namespace should be
>> forbidden except for the leaf of a pid namespaces tree. That should
>
> Others (me included) *will* try and may get upset if forbidden...
> Seriously, there is no technical reason to restrict this.

Ok.

> >> Can you define more precisely what you mean by "enter" the container ?
>>> If you simply want create a new process in the container, you can
>>> achieve the same thing with a daemon, or a smart init process (in
>>> there), or even ptrace tricks.
>> Yes, you can launch a daemon inside the container, that works for a
>> system container because the container is killed by killing the first
>> process of the container or by a shutdown inside the container (not
>> fully implemented in the kernel).
>> But this is unreliable for application containers, I won't enter in the
>> details but the container exits when the application exits, with a
>> daemon inside the container, this is no longer the case because you can
>> not detect the application death as the daemon is always there.
>>
>> With cloneat you restrict the life cycle of the command you launched,
>> that is the container exits as soon as all the processes exited the
>> container, including the spawned command itself.
>
> Then start a daemon _in addition_ to the application, or write a
> daemon that will launch the application and monitor it... And also
> there is ptrace -
Already tried :)

http://lxc.git.sourceforge.net/git/gitweb.cgi?p=lxc/lxc;a=blob;f=src/lxc/lxc_cinit.c;h=8f235483c1a9d9c9e0cc1ba69f1c33f1bc98b8aa;hb=57ff723f6a174a2a01c58c6ac367d118ef12b91c

> But, please let's take this off to a new thread about adding how to
> add a process into a namespace from the outside. FYI, I do think
> such an interface may be useful and nicer than the two alternatives
> I suggested above.
>
>>> Also, there is a reason why sys_hijack() was hijacked away ... And
>>> I honestly think that a syscall to force another process to clone
>>> would be shot down by the kernel guys.
>> Maybe, maybe not. CLONE_PARENT exists and looks similar to cloneat.
>
> Actually, I misread previously; I mean not forcing another process
> to clone, but instead forcing another process to become a parent (and
> I shall ignore the ethical issues :)
>
> I still suspect it won't be welcome. Several people would have liked
> to see CLONE_PARENT go away, too, if that was possible without breaking
> userspace applications. Yet another reason to take it to a discussion
> of its own.

At this point, I am hesitating of creating a new thread for this
discussion. Because, there will be:
* clone
* clone2
* clone3

and we will discuss again about a new clone syscall with a different API :(

I will not continue arguing on this thread except if someone is in favor
of cloneat.
Otherwise, I will spawn a new thread later.

Thanks
-- Daniel

2009-10-22 11:46:12

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

On 10/22/2009 07:26 PM, Michael Kerrisk wrote:
>>
>> "3" is number of arguments.
>
> sys_clone3(struct clone_struct __user *ucs, pid_t __user *pids)
>
> It appears to me that the number of arguments is 2.
>

It was 3 at one point... I'm not sure when that changed last :-/

>> It's better than "extended" or something like
>> that simply because "extended" just means "more than", and a number at least
>> tells you *how much more than*.
>
> I'm not sure why you think including a number in the name tells us
> "how much more than". Unless you are considering the numbering to be
> version numbers, which apparently is not what you mean.

It is a version number of sorts.

-hpa

2009-10-22 12:14:16

by Michael Kerrisk

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

Peter,

On Thu, Oct 22, 2009 at 1:38 PM, H. Peter Anvin <[email protected]> wrote:
> On 10/22/2009 07:26 PM, Michael Kerrisk wrote:
>>>
>>> "3" is number of arguments.
>>
>> sys_clone3(struct clone_struct __user *ucs, pid_t __user *pids)
>>
>> It appears to me that the number of arguments is 2.
>>
>
> It was 3 at one point... I'm not sure when that changed last :-/
>
>>> It's better than "extended" or something like
>>> that simply because "extended" just means "more than", and a number at
>>> least
>>> tells you *how much more than*.
>>
>> I'm not sure why you think including a number in the name tells us
>> "how much more than". Unless you are considering the numbering to be
>> version numbers, which apparently is not what you mean.
>
> It is a version number of sorts.

So, sometimes, a number in a system call should be the bit width of
some arguments(s), sometimes it should be the number of arguments, and
sometimes (well, just occasionally, as in mmap2() and clone()) -- it
should be a version number? Does the weather play any part in the
decision? ;-)

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface" http://blog.man7.org/

2009-10-22 12:26:31

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

On 10/22/2009 09:14 PM, Michael Kerrisk wrote:
>
> So, sometimes, a number in a system call should be the bit width of
> some arguments(s), sometimes it should be the number of arguments, and
> sometimes (well, just occasionally, as in mmap2() and clone()) -- it
> should be a version number? Does the weather play any part in the
> decision? ;-)
>

The notion is that they are *some* kind of description on how the system
call has been augmented. The bitwidths and argument numbers are
non-overlapping and visually very different, so are not subject to
confusion. Your argument makes about as much sense as saying the letter
"a" should have the same meaning in every context.

-hpa

2009-10-22 13:57:26

by Matt Helsley

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

On Thu, Oct 22, 2009 at 02:14:16PM +0200, Michael Kerrisk wrote:
> Peter,
>
> On Thu, Oct 22, 2009 at 1:38 PM, H. Peter Anvin <[email protected]> wrote:
> > On 10/22/2009 07:26 PM, Michael Kerrisk wrote:
> >>>
> >>> "3" is number of arguments.
> >>
> >> sys_clone3(struct clone_struct __user *ucs, pid_t __user *pids)
> >>
> >> It appears to me that the number of arguments is 2.
> >>
> >
> > It was 3 at one point... I'm not sure when that changed last :-/
> >
> >>> It's better than "extended" or something like
> >>> that simply because "extended" just means "more than", and a number at
> >>> least
> >>> tells you *how much more than*.
> >>
> >> I'm not sure why you think including a number in the name tells us
> >> "how much more than". Unless you are considering the numbering to be
> >> version numbers, which apparently is not what you mean.
> >
> > It is a version number of sorts.
>
> So, sometimes, a number in a system call should be the bit width of
> some arguments(s), sometimes it should be the number of arguments, and
> sometimes (well, just occasionally, as in mmap2() and clone()) -- it
> should be a version number? Does the weather play any part in the

In this case 3 could be both the number of arguments and the version
number (clone, clone2, clone3). Match 2 conventions with one choice of
name. :)

Cheers,
-Matt Helsley

2009-10-22 18:09:30

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

Michael Kerrisk [[email protected]] wrote:
| Sukadev,
|
| On Wed, Oct 21, 2009 at 9:44 PM, Sukadev Bhattiprolu
| <[email protected]> wrote:
| > H. Peter Anvin [[email protected]] wrote:
| >> On 10/21/2009 01:26 PM, Michael Kerrisk wrote:
| >>>
| >>> My question here is: what does "3" actually mean? In general, system
| >>> calls have not followed any convention of numbering to indicate
| >>> successive versions -- clone2() being the one possible exception that
| >>> I know of.
| >>>
| >>
| >> "3" is number of arguments.
| >
| > To me, it is a version number.
|
| See my precending mail. Isn't the number of arguments "2".

Well it was 2 at one point, but I have posted a new version of
just that one patch - please see http://lkml.org/lkml/2009/10/16/3
for comments.

I am working on some updates and will post a new patchset - it
will have 3 parameters to clone3() as shown in the above mail.

|
| > mmap() and mmap2() both have 6 parameters.
| >
| > Besides if wait4() were born before wait3(), would it still be wait4() :-)
| > But I see that it is hard to get one-convention-that-fits-all.
|
| Yes -- that's exactly right.
|
| >> It's better than "extended" or something
| >> like that simply because "extended" just means "more than", and a number
| >> at least tells you *how much more than*.
| >
| > And extended assumes we wont extend again.
|
| Well, if we do things right in this design, we may not need to ever
| extend (by creating a new syscall) again. That's why I mentioned the
| "flags" argument idea. Did you give this some thought?

Yes, we have done the best we can to avoid extending clone() again
anytime soon (some reserved bytes and clone_args_size field). Would
we still need the flags parameter ? Again its in the new patch
that I pointed to above.

|
| > An informal poll of reviewers has clone3() with a slight advantage :-)
| >
| > ? ? ? ?clone_extended() camp: Serge Hallyn, Kerrisk, Louis Rilling,
| > ? ? ? ?clone3(): Sukadev, H. Peter Anvin, Oren, Matt Helsley.
| >
| > I like clone3() but am not insisting on it. I just want a name...
|
| And I'm not really insisting on a change. As you rightly point out,
| there is much inconsistency in the naming conventions that have been
| used over the years.
|
| But, because there has been no consistency in the use of numbers, and
| because the number of arguments that are presented in a glibc
| interface may differ from the number of arguments in an underlying
| syscall (several precedents: signalfd4(), pselect(), ppoll()), I'm
| inclined to think that clonex() or clone_ext() is slighly better than
| clone3(). But, certainly, my arguments are not compelling.

2009-10-23 00:41:33

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Eric W. Biederman [[email protected]] wrote:
| +static int set_pidmap(struct pid_namespace *pid_ns, int target)
| +{
| + if (target >= pid_max)
| + return -1;

I am changing this and the next return to 'return -EINVAL', to match
an earlier patch in my patchset.

| + if (target < RESERVED_PIDS)

Should we replace RESERVED_PIDS with 0 ? We currently allow new
containers to have pids 1..32K in the first pass and in subsequent
passes assign starting at RESERVED_PIDS.

Sukadev

2009-10-23 01:03:53

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Sukadev Bhattiprolu <[email protected]> writes:

> Eric W. Biederman [[email protected]] wrote:
> | +static int set_pidmap(struct pid_namespace *pid_ns, int target)
> | +{
> | + if (target >= pid_max)
> | + return -1;
>
> I am changing this and the next return to 'return -EINVAL', to match
> an earlier patch in my patchset.
>
> | + if (target < RESERVED_PIDS)
>
> Should we replace RESERVED_PIDS with 0 ? We currently allow new
> containers to have pids 1..32K in the first pass and in subsequent
> passes assign starting at RESERVED_PIDS.

If it is a preexisting namespace pid namespace removing the RESERVED_PIDS
check removes most if not all of the point of RESERVED_PIDS.

In a new fresh pid namespace I have no problem with not performing
the RESERVED_PIDS check.

So I guess that makes the check.

if ((target < RESERVED_PIDS) && pid_ns->last_pid >= RESERVED_PIDS)
return -EINVAL;

Eric

2009-10-23 05:28:37

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Eric W. Biederman [[email protected]] wrote:
| > | + if (target < RESERVED_PIDS)
| >
| > Should we replace RESERVED_PIDS with 0 ? We currently allow new
| > containers to have pids 1..32K in the first pass and in subsequent
| > passes assign starting at RESERVED_PIDS.
|
| If it is a preexisting namespace pid namespace removing the RESERVED_PIDS
| check removes most if not all of the point of RESERVED_PIDS.
|
| In a new fresh pid namespace I have no problem with not performing
| the RESERVED_PIDS check.

In that case can we do this

if (target_pid < RESERVED_PIDS && !pid_ns->level)
return -EINVAL;

instead ?
|
| So I guess that makes the check.
|
| if ((target < RESERVED_PIDS) && pid_ns->last_pid >= RESERVED_PIDS)
| return -EINVAL;

I am just wondering if there is a small corner case where C/R would randomly
fail because of this sequence:

- C/R code calls clone() or clone3() say about RESERVED_PIDS-1
times and ->last_pid == RESERVED_PIDS-1.

- C/R code calls normal fork()/alloc_pidmap() for a short-lived
child - its pid == ->last_pid == RESERVED_PIDS

- C/R code then calls clone3()/set_pidmap() to set the pid of
a new child to RESERVED_PID but fails (i.e it fails to restore
a pid even when the pid is not in use).

We could argue that mixing alloc_pidmap() and set_pidmap() during restart
is bad since set_pidmap() may fail.

The C/R developer could argue that we are forcing them to specify a pid
even for a short lived process that they wait()s on and thus ensure that
pid is not in use.

Anyway, is RESERVED_PIDS meant for initial kernel-threads/daemons - if so
would it be ok enforce it only in init_pid_ns ?

Sukadev

2009-10-23 05:44:25

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Sukadev Bhattiprolu <[email protected]> writes:

> Eric W. Biederman [[email protected]] wrote:
> | > | + if (target < RESERVED_PIDS)
> | >
> | > Should we replace RESERVED_PIDS with 0 ? We currently allow new
> | > containers to have pids 1..32K in the first pass and in subsequent
> | > passes assign starting at RESERVED_PIDS.
> |
> | If it is a preexisting namespace pid namespace removing the RESERVED_PIDS
> | check removes most if not all of the point of RESERVED_PIDS.
> |
> | In a new fresh pid namespace I have no problem with not performing
> | the RESERVED_PIDS check.
>
> In that case can we do this
>
> if (target_pid < RESERVED_PIDS && !pid_ns->level)
> return -EINVAL;
>
> instead ?
> |
> | So I guess that makes the check.
> |
> | if ((target < RESERVED_PIDS) && pid_ns->last_pid >= RESERVED_PIDS)
> | return -EINVAL;
>
> I am just wondering if there is a small corner case where C/R would randomly
> fail because of this sequence:
>
> - C/R code calls clone() or clone3() say about RESERVED_PIDS-1
> times and ->last_pid == RESERVED_PIDS-1.
>
> - C/R code calls normal fork()/alloc_pidmap() for a short-lived
> child - its pid == ->last_pid == RESERVED_PIDS
>
> - C/R code then calls clone3()/set_pidmap() to set the pid of
> a new child to RESERVED_PID but fails (i.e it fails to restore
> a pid even when the pid is not in use).
>
> We could argue that mixing alloc_pidmap() and set_pidmap() during restart
> is bad since set_pidmap() may fail.
>
> The C/R developer could argue that we are forcing them to specify a pid
> even for a short lived process that they wait()s on and thus ensure that
> pid is not in use.
>
> Anyway, is RESERVED_PIDS meant for initial kernel-threads/daemons - if so
> would it be ok enforce it only in init_pid_ns ?

It is mean for initial user space daemons, things that start on boot.

I don't know how much the protection matters at this date, but we have it.

Eric

2009-10-23 19:16:51

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call



Sukadev Bhattiprolu wrote:
> Eric W. Biederman [[email protected]] wrote:
> | > | + if (target < RESERVED_PIDS)
> | >
> | > Should we replace RESERVED_PIDS with 0 ? We currently allow new
> | > containers to have pids 1..32K in the first pass and in subsequent
> | > passes assign starting at RESERVED_PIDS.
> |
> | If it is a preexisting namespace pid namespace removing the RESERVED_PIDS
> | check removes most if not all of the point of RESERVED_PIDS.
> |
> | In a new fresh pid namespace I have no problem with not performing
> | the RESERVED_PIDS check.
>
> In that case can we do this
>
> if (target_pid < RESERVED_PIDS && !pid_ns->level)
> return -EINVAL;
>
> instead ?
> |
> | So I guess that makes the check.
> |
> | if ((target < RESERVED_PIDS) && pid_ns->last_pid >= RESERVED_PIDS)
> | return -EINVAL;
>
> I am just wondering if there is a small corner case where C/R would randomly
> fail because of this sequence:
>
> - C/R code calls clone() or clone3() say about RESERVED_PIDS-1
> times and ->last_pid == RESERVED_PIDS-1.
>
> - C/R code calls normal fork()/alloc_pidmap() for a short-lived
> child - its pid == ->last_pid == RESERVED_PIDS
>
> - C/R code then calls clone3()/set_pidmap() to set the pid of
> a new child to RESERVED_PID but fails (i.e it fails to restore
> a pid even when the pid is not in use).

Not only for short-lived children. The problem is restart will succeed
or fail depending on the order in which tasks were checkpointed. If
task with pid 290 is restarted after pid 305, restart will fail.

And because chekcpoint scans the task tree in a DFS manner, this is
more likely to happen than not.

I wonder why you'd like to restrict a pid-specific clone like that ?
It is already a privileged syscall, so it could be exempt. I suggest
that only regular clones will be constrained.

Oren.

2009-10-23 19:20:05

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Eric W. Biederman [[email protected]] wrote:
| > Anyway, is RESERVED_PIDS meant for initial kernel-threads/daemons - if so
| > would it be ok enforce it only in init_pid_ns ?
|
| It is mean for initial user space daemons, things that start on boot.
|
| I don't know how much the protection matters at this date, but we have it.

Well, since it is not security or other critical restriction, can we allow
set_pidmap() a free hand - even in init-pid-ns ? It could prevent a simple
subtree C/R of one of the early daemons for debug for instance.

Sukadev

2009-10-23 19:34:10

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call



Oren Laadan wrote:
>
> Sukadev Bhattiprolu wrote:
>> Eric W. Biederman [[email protected]] wrote:
>> | > | + if (target < RESERVED_PIDS)
>> | >
>> | > Should we replace RESERVED_PIDS with 0 ? We currently allow new
>> | > containers to have pids 1..32K in the first pass and in subsequent
>> | > passes assign starting at RESERVED_PIDS.
>> |
>> | If it is a preexisting namespace pid namespace removing the RESERVED_PIDS
>> | check removes most if not all of the point of RESERVED_PIDS.
>> |
>> | In a new fresh pid namespace I have no problem with not performing
>> | the RESERVED_PIDS check.
>>
>> In that case can we do this
>>
>> if (target_pid < RESERVED_PIDS && !pid_ns->level)
>> return -EINVAL;
>>
>> instead ?
>> |
>> | So I guess that makes the check.
>> |
>> | if ((target < RESERVED_PIDS) && pid_ns->last_pid >= RESERVED_PIDS)
>> | return -EINVAL;
>>
>> I am just wondering if there is a small corner case where C/R would randomly
>> fail because of this sequence:
>>
>> - C/R code calls clone() or clone3() say about RESERVED_PIDS-1
>> times and ->last_pid == RESERVED_PIDS-1.
>>
>> - C/R code calls normal fork()/alloc_pidmap() for a short-lived
>> child - its pid == ->last_pid == RESERVED_PIDS
>>
>> - C/R code then calls clone3()/set_pidmap() to set the pid of
>> a new child to RESERVED_PID but fails (i.e it fails to restore
>> a pid even when the pid is not in use).
>
> Not only for short-lived children. The problem is restart will succeed
> or fail depending on the order in which tasks were checkpointed. If
> task with pid 290 is restarted after pid 305, restart will fail.
>
> And because chekcpoint scans the task tree in a DFS manner, this is
> more likely to happen than not.
>
> I wonder why you'd like to restrict a pid-specific clone like that ?
> It is already a privileged syscall, so it could be exempt. I suggest
> that only regular clones will be constrained.

I stand corrected by Suka: a pid-specific clone does not change
last_pid. Therefore, given that 'restart' only creates tasks with
pid-specific clone, this should be safe for c/r.

Oren.

2009-10-23 20:47:07

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Sukadev Bhattiprolu [[email protected]] wrote:
| Eric W. Biederman [[email protected]] wrote:
| | > Anyway, is RESERVED_PIDS meant for initial kernel-threads/daemons - if so
| | > would it be ok enforce it only in init_pid_ns ?
| |
| | It is mean for initial user space daemons, things that start on boot.
| |
| | I don't know how much the protection matters at this date, but we have it.
|
| Well, since it is not security or other critical restriction, can we allow
| set_pidmap() a free hand - even in init-pid-ns ? It could prevent a simple
| subtree C/R of one of the early daemons for debug for instance.

So here is how I have it at present. I would like to remove the RESERVED_PIDS
check in set_pidmap() if its ok to do so.

alloc_pid() does this:

if (target_pids)
set_pidmap(tmp, target_pids[i])
else
alloc_pidmap(tmp);

Sukadev
---

>From bc6093fc4fc2f01070647df6f1e85e45edc89d27 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <suka@suka.(none)>
Date: Thu, 22 Oct 2009 16:57:28 -0700
Subject: [PATCH] Define set_pidmap() function

Define a set_pidmap() interface which is like alloc_pidmap() only that
caller specifies the pid number to be assigned.

Changelog[v9]:
- Complete rewrite this patch based on Eric Biederman's code.
Changelog[v7]:
- [Eric Biederman] Generalize alloc_pidmap() to take a range of pids.
Changelog[v6]:
- Separate target_pid > 0 case to minimize the number of checks needed.
Changelog[v3]:
- (Eric Biederman): Avoid set_pidmap() function. Added couple of
checks for target_pid in alloc_pidmap() itself.
Changelog[v2]:
- (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
actually checks for 'pid <= 0' for completeness).

Signed-off-by: Sukadev Bhattiprolu <[email protected]>
---
kernel/pid.c | 40 ++++++++++++++++++++++++++++++++--------
1 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index c4d9914..9346755 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -147,18 +147,19 @@ static int alloc_pidmap_page(struct pidmap *map)
return 0;
}

-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min,
+ int max)
{
- int i, offset, max_scan, pid, last = pid_ns->last_pid;
+ int i, offset, max_scan, pid;
int rc = -EAGAIN;
struct pidmap *map;

pid = last + 1;
if (pid >= pid_max)
- pid = RESERVED_PIDS;
+ pid = min;
offset = pid & BITS_PER_PAGE_MASK;
map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
- max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+ max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
for (i = 0; i <= max_scan; ++i) {
rc = alloc_pidmap_page(map);
if (rc)
@@ -168,7 +169,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
do {
if (!test_and_set_bit(offset, map->page)) {
atomic_dec(&map->nr_free);
- pid_ns->last_pid = pid;
return pid;
}
offset = find_next_offset(map, offset);
@@ -179,16 +179,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
* bitmap block and the final block was the same
* as the starting point, pid is before last_pid.
*/
- } while (offset < BITS_PER_PAGE && pid < pid_max &&
+ } while (offset < BITS_PER_PAGE && pid < max &&
(i != max_scan || pid < last ||
!((last+1) & BITS_PER_PAGE_MASK)));
}
- if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
+ if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
++map;
offset = 0;
} else {
map = &pid_ns->pidmap[0];
- offset = RESERVED_PIDS;
+ offset = min;
if (unlikely(last == offset)) {
rc = -EAGAIN;
break;
@@ -199,6 +199,30 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
return rc;
}

+static int alloc_pidmap(struct pid_namespace *pid_ns)
+{
+ int nr;
+
+ nr = do_alloc_pidmap(pid_ns, pid_ns->last, RESERVED_PIDS, pid_max);
+ if (nr >= 0)
+ pid_ns->last_pid = nr;
+ return nr;
+}
+
+static int set_pidmap(struct pid_namespace *pid_ns, int target)
+{
+ if (!target)
+ return alloc_pidmap(pid_ns);
+
+ if (target >= pid_max)
+ return -EINVAL;
+
+ if ((target < 0) || (target < RESERVED_PIDS && pid_ns == &init_pid_ns))
+ return -EINVAL;
+
+ return do_alloc_pidmap(pid_ns, target - 1, target, target + 1);
+}
+
int next_pidmap(struct pid_namespace *pid_ns, int last)
{
int offset;
--
1.6.0.4

2009-10-23 23:12:25

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Oren Laadan <[email protected]> writes:


> I stand corrected by Suka: a pid-specific clone does not change
> last_pid. Therefore, given that 'restart' only creates tasks with
> pid-specific clone, this should be safe for c/r.

Exactly. We can preserve the existing guarantees and keep the ability
to restore any pid (at least in a fresh pid namespace).

Eric

2009-10-23 23:25:59

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Sukadev Bhattiprolu <[email protected]> writes:

> Sukadev Bhattiprolu [[email protected]] wrote:
> | Eric W. Biederman [[email protected]] wrote:
> | | > Anyway, is RESERVED_PIDS meant for initial kernel-threads/daemons - if so
> | | > would it be ok enforce it only in init_pid_ns ?
> | |
> | | It is mean for initial user space daemons, things that start on boot.
> | |
> | | I don't know how much the protection matters at this date, but we have it.
> |
> | Well, since it is not security or other critical restriction, can we allow
> | set_pidmap() a free hand - even in init-pid-ns ? It could prevent a simple
> | subtree C/R of one of the early daemons for debug for instance.
>
> So here is how I have it at present. I would like to remove the RESERVED_PIDS
> check in set_pidmap() if its ok to do so.
>
> alloc_pid() does this:
>
> if (target_pids)
> set_pidmap(tmp, target_pids[i])
> else
> alloc_pidmap(tmp);
>
> Sukadev
> ---
>
>>From bc6093fc4fc2f01070647df6f1e85e45edc89d27 Mon Sep 17 00:00:00 2001
> From: Sukadev Bhattiprolu <suka@suka.(none)>
> Date: Thu, 22 Oct 2009 16:57:28 -0700
> Subject: [PATCH] Define set_pidmap() function
>
> Define a set_pidmap() interface which is like alloc_pidmap() only that
> caller specifies the pid number to be assigned.
>
> Changelog[v9]:
> - Complete rewrite this patch based on Eric Biederman's code.
> Changelog[v7]:
> - [Eric Biederman] Generalize alloc_pidmap() to take a range of pids.
> Changelog[v6]:
> - Separate target_pid > 0 case to minimize the number of checks needed.
> Changelog[v3]:
> - (Eric Biederman): Avoid set_pidmap() function. Added couple of
> checks for target_pid in alloc_pidmap() itself.
> Changelog[v2]:
> - (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
> actually checks for 'pid <= 0' for completeness).
>
> Signed-off-by: Sukadev Bhattiprolu <[email protected]>
> ---
> kernel/pid.c | 40 ++++++++++++++++++++++++++++++++--------
> 1 files changed, 32 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index c4d9914..9346755 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -147,18 +147,19 @@ static int alloc_pidmap_page(struct pidmap *map)
> return 0;
> }
>
> -static int alloc_pidmap(struct pid_namespace *pid_ns)
> +static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min,
> + int max)
> {
> - int i, offset, max_scan, pid, last = pid_ns->last_pid;
> + int i, offset, max_scan, pid;
> int rc = -EAGAIN;
> struct pidmap *map;
>
> pid = last + 1;
> if (pid >= pid_max)
> - pid = RESERVED_PIDS;
> + pid = min;
> offset = pid & BITS_PER_PAGE_MASK;
> map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
> - max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
> + max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
> for (i = 0; i <= max_scan; ++i) {
> rc = alloc_pidmap_page(map);
> if (rc)
> @@ -168,7 +169,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
> do {
> if (!test_and_set_bit(offset, map->page)) {
> atomic_dec(&map->nr_free);
> - pid_ns->last_pid = pid;
> return pid;
> }
> offset = find_next_offset(map, offset);
> @@ -179,16 +179,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
> * bitmap block and the final block was the same
> * as the starting point, pid is before last_pid.
> */
> - } while (offset < BITS_PER_PAGE && pid < pid_max &&
> + } while (offset < BITS_PER_PAGE && pid < max &&
> (i != max_scan || pid < last ||
> !((last+1) & BITS_PER_PAGE_MASK)));
> }
> - if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
> + if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
> ++map;
> offset = 0;
> } else {
> map = &pid_ns->pidmap[0];
> - offset = RESERVED_PIDS;
> + offset = min;
> if (unlikely(last == offset)) {
> rc = -EAGAIN;
> break;
> @@ -199,6 +199,30 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
> return rc;
> }
>
> +static int alloc_pidmap(struct pid_namespace *pid_ns)
> +{
> + int nr;
> +
> + nr = do_alloc_pidmap(pid_ns, pid_ns->last, RESERVED_PIDS, pid_max);
pid_ns->last_pid,

Looks like I missed that one.

> + if (nr >= 0)
> + pid_ns->last_pid = nr;
> + return nr;
> +}
> +
> +static int set_pidmap(struct pid_namespace *pid_ns, int target)
> +{
> + if (!target)
> + return alloc_pidmap(pid_ns);
> +
> + if (target >= pid_max)
> + return -EINVAL;
> +
> + if ((target < 0) || (target < RESERVED_PIDS && pid_ns == &init_pid_ns))
> + return -EINVAL;

if ((target < 0) || ((target < RESERVED_PIDS) && (pid_ns->last_pid >= RESERVED_PIDS)))

Please.

Eric

> +
> + return do_alloc_pidmap(pid_ns, target - 1, target, target + 1);
> +}
> +
> int next_pidmap(struct pid_namespace *pid_ns, int last)
> {
> int offset;
> --
> 1.6.0.4

2009-10-24 03:37:20

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Eric W. Biederman [[email protected]] wrote:
| > +static int set_pidmap(struct pid_namespace *pid_ns, int target)
| > +{
| > + if (!target)
| > + return alloc_pidmap(pid_ns);

BTW, we need this now that the RESERVED_PIDS check is is conditional on
->last_pid. But this makes set_pidmap() completely general so should we
have alloc_pid() call set_pidmap() always ? Or we could move this check
into alloc_pid(), but it may be better to have all values of 'target'
checked in one place.

| > +
| > + if (target >= pid_max)
| > + return -EINVAL;
| > +
| > + if ((target < 0) || (target < RESERVED_PIDS && pid_ns == &init_pid_ns))
| > + return -EINVAL;
|
| if ((target < 0) || ((target < RESERVED_PIDS) && (pid_ns->last_pid >= RESERVED_PIDS)))
|
| Please.

Ok.

2009-10-26 09:38:52

by Albert Cahalan

[permalink] [raw]
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

Sukadev Bhattiprolu writes:

> struct clone_struct {
> u64 flags;
> u64 child_stack;
> u32 nr_pids;
> u32 reserved1;
> u64 parent_tid;
> u64 child_tid;
> u64 reserved2;
> };
>
> sys_clone3(struct clone_struct __user *cs, pid_t __user *pids)

We'll be needing a clone4() for ia64 if you don't include the
stack_size parameter.

You have a prime opportunity to fix a portability problem that
has been quite a pain for some time. There are currently 3 cases
that userspace has to suffer with:

1. specify one end of the stack with clone
2. specify the other end of the stack with clone (HP PA-RISC)
3. specify the base and size with clone2 (IA-64)

Portable code ends up with an icky #ifdef mess.

BTW, the extra stack info is also useful for other purposes.
It's information that could appear in /proc/*/maps data.
With an extra flag, thread stacks could be freed on thread exit
so that other threads don't have to babysit an exiting thread.