From: Sukadev Bhattiprolu <[email protected]>
Subject: [v11][PATCH 0/9] Implement clone_with_pids() system call
To support application checkpoint/restart, a task must have the same pid it
had when it was checkpointed. When containers are nested, the tasks within
the containers exist in multiple pid namespaces and hence have multiple pids
to specify during restart.
This patchset implements a new system call, clone_with_pids() that lets a
process specify the pids of the child process.
Patches 1 through 6 are helper patches needed for choosing a pid for the
child process.
PATCH 8 defines a prototype of the new system call. PATCH 9 adds some
documentation on the new system call, some/all of which will eventually
go into a man page.
Changelog[v11]:
- [Dave Hansen] Move clone_args validation checks to arch-indpeendent
code.
- [Oren Laadan] Make args_size a parameter to system call and remove
it from 'struct clone_args'
Changelog[v10]:
- [Linus Torvalds] Use PTREGSCALL() implementation for clone rather
than the generic system call
- Rename clone3() to clone_with_pids()
- Update Documentation/clone_with_pids() to show example usage with
the PTREGSCALL implementation.
Changelog[v9]:
- [Pavel Emelyanov] Drop the patch that made 'pid_max' a property
of struct pid_namespace
- [Roland McGrath, H. Peter Anvin and earlier on, Serge Hallyn] To
avoid inadvertent truncation clone_flags, preserve the first
parameter of clone3() as 'u32 clone_flags' and specify newer
flags in clone_args.flags_high (PATCH 8/9 and PATCH 9/9)
- [Eric Biederman] Generalize alloc_pidmap() code to simplify and
remove duplication (see PATCH 3/9].
Changelog[v8]:
- [Oren Laadan, Louis Rilling, KOSAKI Motohiro]
The name 'clone2()' is in use - renamed new syscall to clone3().
- [Oren Laadan] ->parent_tidptr and ->child_tidptr need to be 64bit.
- [Oren Laadan] Ensure that unused fields/flags in clone_struct are 0.
(Added [PATCH 7/10] to the patchset).
Changelog[v7]:
- [Peter Zijlstra, Arnd Bergmann]
Group the arguments to clone2() into a 'struct clone_arg' to
workaround the issue of exceeding 6 arguments to the system call.
Also define clone-flags as u64 to allow additional clone-flags.
Changelog[v6]:
- [Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds]
Change 'pid_set.pids' to 'pid_t pids[]' so sizeof(struct pid_set) is
constant across architectures (Patches 7, 8).
- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
'unum_pids < 0' check (Patches 7,8)
- (Pavel Machek) New patch (Patch 9) to add some documentation.
Changelog[v5]:
- Make 'pid_max' a property of pid_ns (Integrated Serge Hallyn's patch
into this set)
- (Eric Biederman): Avoid the new function, set_pidmap() - added
couple of checks on 'target_pid' in alloc_pidmap() itself.
=== IMPORTANT NOTE:
clone() system call has another limitation - all but one bits in clone-flags
are in use and if more new clone-flags are needed, we will need a variant of
the clone() system call.
It appears to make sense to try and extend this new system call to address
this limitation as well. The requirements of a new clone system call could
then be summarized as:
- do everything clone() does today, and
- give application an ability to choose pids for the child process
in all ancestor pid namespaces, and
- allow more clone_flags
Contstraints:
- system-calls are restricted to 6 parameters and clone() already
takes 5 parameters, any extension to clone() interface would require
one or more copy_from_user(). (Not sure if copy_from_user() of ~40
bytes would have a significant impact on performance of clone()).
Based on these requirements and constraints, we explored a couple of system
call interfaces (in earlier versions of this patchset). Based on input from
Arnd Bergmann and others, the new interface of the system call is:
struct clone_args {
u64 clone_flags_high;
u64 child_stack_base;
u64 child_stack_size;
u64 parent_tid_ptr;
u64 child_tid_ptr;
u32 nr_pids;
u32 reserved0;
u64 reserved1;
};
sys_clone_with_pids(u32 flags_low, struct clone_args *cargs,
int args_size, pid_t *pids)
Details of the struct clone_args and the usage are explained in the
documentation (PATCH 9/9).
NOTE:
While this patchset enables support for more clone-flags, actual
implementation for additional clone-flags is best implemented as
a separate patchset (PATCH 8/9 identifies some TODOs)
Signed-off-by: Sukadev Bhattiprolu <[email protected]>
From: Sukadev Bhattiprolu <[email protected]>
Subject: [v11][PATCH 1/9] Factor out code to allocate pidmap page
To simplify alloc_pidmap(), move code to allocate a pid map page to a
separate function.
Changelog[v3]:
- Earlier version of patchset called alloc_pidmap_page() from two
places. But now its called from only one place. Even so, moving
this code out into a separate function simplifies alloc_pidmap().
Changelog[v2]:
- (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
-ENOMEM on error instead of -1.
Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Reviewed-by: Oren Laadan <[email protected]>
---
kernel/pid.c | 45 ++++++++++++++++++++++++++++++---------------
1 files changed, 30 insertions(+), 15 deletions(-)
diff --git a/kernel/pid.c b/kernel/pid.c
index d3f722d..7d4bb6e 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,9 +122,35 @@ static void free_pidmap(struct upid *upid)
atomic_inc(&map->nr_free);
}
+static int alloc_pidmap_page(struct pidmap *map)
+{
+ void *page;
+
+ if (likely(map->page))
+ return 0;
+
+ page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+ /*
+ * Free the page if someone raced with us installing it:
+ */
+ spin_lock_irq(&pidmap_lock);
+ if (map->page)
+ kfree(page);
+ else
+ map->page = page;
+ spin_unlock_irq(&pidmap_lock);
+
+ if (unlikely(!map->page))
+ return -ENOMEM;
+
+ return 0;
+}
+
static int alloc_pidmap(struct pid_namespace *pid_ns)
{
int i, offset, max_scan, pid, last = pid_ns->last_pid;
+ int rc;
struct pidmap *map;
pid = last + 1;
@@ -134,21 +160,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
for (i = 0; i <= max_scan; ++i) {
- if (unlikely(!map->page)) {
- void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
- /*
- * Free the page if someone raced with us
- * installing it:
- */
- spin_lock_irq(&pidmap_lock);
- if (map->page)
- kfree(page);
- else
- map->page = page;
- spin_unlock_irq(&pidmap_lock);
- if (unlikely(!map->page))
- break;
- }
+ rc = alloc_pidmap_page(map);
+ if (rc)
+ break;
+
if (likely(atomic_read(&map->nr_free))) {
do {
if (!test_and_set_bit(offset, map->page)) {
--
1.6.0.4
From: Sukadev Bhattiprolu <[email protected]>
Subject: [v11][PATCH 2/9] Have alloc_pidmap() return actual error code
alloc_pidmap() can fail either because all pid numbers are in use or
because memory allocation failed. With support for setting a specific
pid number, alloc_pidmap() would also fail if either the given pid
number is invalid or in use.
Rather than have callers assume -ENOMEM, have alloc_pidmap() return
the actual error.
Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Reviewed-by: Oren Laadan <[email protected]>
---
kernel/fork.c | 5 +++--
kernel/pid.c | 14 +++++++++-----
2 files changed, 12 insertions(+), 7 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 4c20fff..93626b2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1156,10 +1156,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto bad_fork_cleanup_io;
if (pid != &init_struct_pid) {
- retval = -ENOMEM;
pid = alloc_pid(p->nsproxy->pid_ns);
- if (!pid)
+ if (IS_ERR(pid)) {
+ retval = PTR_ERR(pid);
goto bad_fork_cleanup_io;
+ }
if (clone_flags & CLONE_NEWPID) {
retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
diff --git a/kernel/pid.c b/kernel/pid.c
index 7d4bb6e..c4d9914 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -150,7 +150,7 @@ static int alloc_pidmap_page(struct pidmap *map)
static int alloc_pidmap(struct pid_namespace *pid_ns)
{
int i, offset, max_scan, pid, last = pid_ns->last_pid;
- int rc;
+ int rc = -EAGAIN;
struct pidmap *map;
pid = last + 1;
@@ -189,12 +189,14 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
} else {
map = &pid_ns->pidmap[0];
offset = RESERVED_PIDS;
- if (unlikely(last == offset))
+ if (unlikely(last == offset)) {
+ rc = -EAGAIN;
break;
+ }
}
pid = mk_pid(pid_ns, map, offset);
}
- return -1;
+ return rc;
}
int next_pidmap(struct pid_namespace *pid_ns, int last)
@@ -263,8 +265,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
struct upid *upid;
pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
- if (!pid)
+ if (!pid) {
+ pid = ERR_PTR(-ENOMEM);
goto out;
+ }
tmp = ns;
for (i = ns->level; i >= 0; i--) {
@@ -299,7 +303,7 @@ out_free:
free_pidmap(pid->numbers + i);
kmem_cache_free(ns->pid_cachep, pid);
- pid = NULL;
+ pid = ERR_PTR(nr);
goto out;
}
--
1.6.0.4
From: Sukadev Bhattiprolu <[email protected]>
Subject: [v11][PATCH 3/9] Define set_pidmap() function
Define a set_pidmap() interface which is like alloc_pidmap() only that
caller specifies the pid number to be assigned.
Changelog[v9]:
- Completely rewrote this patch based on Eric Biederman's code.
Changelog[v7]:
- [Eric Biederman] Generalize alloc_pidmap() to take a range of pids.
Changelog[v6]:
- Separate target_pid > 0 case to minimize the number of checks needed.
Changelog[v3]:
- (Eric Biederman): Avoid set_pidmap() function. Added couple of
checks for target_pid in alloc_pidmap() itself.
Changelog[v2]:
- (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
actually checks for 'pid <= 0' for completeness).
Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Oren Laadan <[email protected]>
---
kernel/pid.c | 41 +++++++++++++++++++++++++++++++++--------
1 files changed, 33 insertions(+), 8 deletions(-)
diff --git a/kernel/pid.c b/kernel/pid.c
index c4d9914..c849327 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -147,18 +147,19 @@ static int alloc_pidmap_page(struct pidmap *map)
return 0;
}
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min,
+ int max)
{
- int i, offset, max_scan, pid, last = pid_ns->last_pid;
+ int i, offset, max_scan, pid;
int rc = -EAGAIN;
struct pidmap *map;
pid = last + 1;
if (pid >= pid_max)
- pid = RESERVED_PIDS;
+ pid = min;
offset = pid & BITS_PER_PAGE_MASK;
map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
- max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+ max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
for (i = 0; i <= max_scan; ++i) {
rc = alloc_pidmap_page(map);
if (rc)
@@ -168,7 +169,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
do {
if (!test_and_set_bit(offset, map->page)) {
atomic_dec(&map->nr_free);
- pid_ns->last_pid = pid;
return pid;
}
offset = find_next_offset(map, offset);
@@ -179,16 +179,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
* bitmap block and the final block was the same
* as the starting point, pid is before last_pid.
*/
- } while (offset < BITS_PER_PAGE && pid < pid_max &&
+ } while (offset < BITS_PER_PAGE && pid < max &&
(i != max_scan || pid < last ||
!((last+1) & BITS_PER_PAGE_MASK)));
}
- if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
+ if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
++map;
offset = 0;
} else {
map = &pid_ns->pidmap[0];
- offset = RESERVED_PIDS;
+ offset = min;
if (unlikely(last == offset)) {
rc = -EAGAIN;
break;
@@ -199,6 +199,31 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
return rc;
}
+static int alloc_pidmap(struct pid_namespace *pid_ns)
+{
+ int nr;
+
+ nr = do_alloc_pidmap(pid_ns, pid_ns->last_pid, RESERVED_PIDS, pid_max);
+ if (nr >= 0)
+ pid_ns->last_pid = nr;
+ return nr;
+}
+
+static int set_pidmap(struct pid_namespace *pid_ns, int target)
+{
+ if (!target)
+ return alloc_pidmap(pid_ns);
+
+ if (target >= pid_max)
+ return -EINVAL;
+
+ if ((target < 0) || (target < RESERVED_PIDS &&
+ pid_ns->last_pid >= RESERVED_PIDS))
+ return -EINVAL;
+
+ return do_alloc_pidmap(pid_ns, target - 1, target, target + 1);
+}
+
int next_pidmap(struct pid_namespace *pid_ns, int last)
{
int offset;
--
1.6.0.4
From: Sukadev Bhattiprolu <[email protected]>
Subject: [v11][PATCH 4/9] Add target_pids parameter to alloc_pid()
This parameter is currently NULL, but will be used in a follow-on patch.
Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Reviewed-by: Oren Laadan <[email protected]>
---
include/linux/pid.h | 2 +-
kernel/fork.c | 3 ++-
kernel/pid.c | 9 +++++++--
3 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 49f1c2f..914185d 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
int next_pidmap(struct pid_namespace *pid_ns, int last);
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
extern void free_pid(struct pid *pid);
/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 93626b2..3f1dddf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -980,6 +980,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
int retval;
struct task_struct *p;
int cgroup_callbacks_done = 0;
+ pid_t *target_pids = NULL;
if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
@@ -1156,7 +1157,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto bad_fork_cleanup_io;
if (pid != &init_struct_pid) {
- pid = alloc_pid(p->nsproxy->pid_ns);
+ pid = alloc_pid(p->nsproxy->pid_ns, target_pids);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_io;
diff --git a/kernel/pid.c b/kernel/pid.c
index c849327..7c73096 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -281,13 +281,14 @@ void free_pid(struct pid *pid)
call_rcu(&pid->rcu, delayed_put_pid);
}
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
{
struct pid *pid;
enum pid_type type;
int i, nr;
struct pid_namespace *tmp;
struct upid *upid;
+ pid_t tpid;
pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
if (!pid) {
@@ -297,7 +298,11 @@ struct pid *alloc_pid(struct pid_namespace *ns)
tmp = ns;
for (i = ns->level; i >= 0; i--) {
- nr = alloc_pidmap(tmp);
+ tpid = 0;
+ if (target_pids)
+ tpid = target_pids[i];
+
+ nr = set_pidmap(tmp, tpid);
if (nr < 0)
goto out_free;
--
1.6.0.4
From: Sukadev Bhattiprolu <[email protected]>
Subject: [v11][PATCH 5/9] Add target_pids parameter to copy_process()
Add a 'target_pids' parameter to copy_process(). The new parameter will be
used in a follow-on patch when clone_with_pids() is implemented.
Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Reviewed-by: Oren Laadan <[email protected]>
---
kernel/fork.c | 7 ++++---
1 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 3f1dddf..c8a06de 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -975,12 +975,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
unsigned long stack_size,
int __user *child_tidptr,
struct pid *pid,
+ pid_t *target_pids,
int trace)
{
int retval;
struct task_struct *p;
int cgroup_callbacks_done = 0;
- pid_t *target_pids = NULL;
if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
@@ -1361,7 +1361,7 @@ struct task_struct * __cpuinit fork_idle(int cpu)
struct pt_regs regs;
task = copy_process(CLONE_VM, 0, idle_regs(®s), 0, NULL,
- &init_struct_pid, 0);
+ &init_struct_pid, NULL, 0);
if (!IS_ERR(task))
init_idle(task, cpu);
@@ -1384,6 +1384,7 @@ long do_fork(unsigned long clone_flags,
struct task_struct *p;
int trace = 0;
long nr;
+ pid_t *target_pids = NULL;
/*
* Do some preliminary argument and permissions checking before we
@@ -1424,7 +1425,7 @@ long do_fork(unsigned long clone_flags,
trace = tracehook_prepare_clone(clone_flags);
p = copy_process(clone_flags, stack_start, regs, stack_size,
- child_tidptr, NULL, trace);
+ child_tidptr, NULL, target_pids, trace);
/*
* Do this prior waking up the new thread - the thread pointer
* might get invalid after that point, if the thread exits quickly.
--
1.6.0.4
From: Sukadev Bhattiprolu <[email protected]>
Subject: [v11][PATCH 6/9] Check invalid clone flags
As pointed out by Oren Laadan, we want to ensure that unused bits in the
clone-flags remain unused and available for future. To ensure this, define
a mask of clone-flags and check the flags in the clone() system calls.
Changelog[v9]:
- Include the unused clone-flag (CLONE_UNUSED) to VALID_CLONE_FLAGS
to avoid breaking any applications that may have set it. IOW, this
patch/check only applies to clone-flags bits 33 and higher.
Changelog[v8]:
- New patch in set
Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Oren Laadan <[email protected]>
---
include/linux/sched.h | 12 ++++++++++++
kernel/fork.c | 3 +++
2 files changed, 15 insertions(+), 0 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75e6e60..6b319a0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -29,6 +29,18 @@
#define CLONE_NEWNET 0x40000000 /* New network namespace */
#define CLONE_IO 0x80000000 /* Clone io context */
+#define CLONE_UNUSED 0x00001000 /* Can be reused ? */
+
+#define VALID_CLONE_FLAGS (CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\
+ CLONE_SIGHAND | CLONE_UNUSED | CLONE_PTRACE |\
+ CLONE_VFORK | CLONE_PARENT | CLONE_THREAD |\
+ CLONE_NEWNS | CLONE_SYSVSEM | CLONE_SETTLS |\
+ CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID |\
+ CLONE_DETACHED | CLONE_UNTRACED |\
+ CLONE_CHILD_SETTID | CLONE_STOPPED |\
+ CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\
+ CLONE_NEWPID | CLONE_NEWNET| CLONE_IO)
+
/*
* Scheduling policies
*/
diff --git a/kernel/fork.c b/kernel/fork.c
index c8a06de..11f77ed 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -982,6 +982,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
struct task_struct *p;
int cgroup_callbacks_done = 0;
+ if (clone_flags & ~VALID_CLONE_FLAGS)
+ return ERR_PTR(-EINVAL);
+
if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
--
1.6.0.4
From: Sukadev Bhattiprolu <[email protected]>
Subject: [v11][PATCH 7/9] Define do_fork_with_pids()
do_fork_with_pids() is same as do_fork(), except that it takes an
additional, 'pid_set', parameter. This parameter, currently unused,
specifies the set of target pids of the process in each of its pid
namespaces.
Changelog[v7]:
- Drop 'struct pid_set' object and pass in 'pid_t *target_pids'
instead of 'struct pid_set *'.
Changelog[v6]:
- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
is constant across architectures.
- (Nathan Lynch) Change 'pid_set.num_pids' to 'unsigned int'.
Changelog[v4]:
- Rename 'struct target_pid_set' to 'struct pid_set' since it may
be useful in other contexts.
Changelog[v3]:
- Fix "long-line" warning from checkpatch.pl
Changelog[v2]:
- To facilitate moving architecture-inpdendent code to kernel/fork.c
pass in 'struct target_pid_set __user *' to do_fork_with_pids()
rather than 'pid_t *' (next patch moves the arch-independent
code to kernel/fork.c)
Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Reviewed-by: Oren Laadan <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/fork.c | 17 +++++++++++++++--
2 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6b319a0..265efe5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2153,6 +2153,9 @@ extern int disallow_signal(int);
extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
+ unsigned long, int __user *, int __user *,
+ unsigned int, pid_t __user *);
struct task_struct *fork_idle(int);
extern void set_task_comm(struct task_struct *tsk, char *from);
diff --git a/kernel/fork.c b/kernel/fork.c
index 11f77ed..210e841 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1377,12 +1377,14 @@ struct task_struct * __cpuinit fork_idle(int cpu)
* It copies the process, and if successful kick-starts
* it and waits for it to finish using the VM if required.
*/
-long do_fork(unsigned long clone_flags,
+long do_fork_with_pids(unsigned long clone_flags,
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
int __user *parent_tidptr,
- int __user *child_tidptr)
+ int __user *child_tidptr,
+ unsigned int num_pids,
+ pid_t __user *upids)
{
struct task_struct *p;
int trace = 0;
@@ -1485,6 +1487,17 @@ long do_fork(unsigned long clone_flags,
return nr;
}
+long do_fork(unsigned long clone_flags,
+ unsigned long stack_start,
+ struct pt_regs *regs,
+ unsigned long stack_size,
+ int __user *parent_tidptr,
+ int __user *child_tidptr)
+{
+ return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
+ parent_tidptr, child_tidptr, 0, NULL);
+}
+
#ifndef ARCH_MIN_MMSTRUCT_ALIGN
#define ARCH_MIN_MMSTRUCT_ALIGN 0
#endif
--
1.6.0.4
From: Sukadev Bhattiprolu <[email protected]>
Subject: [v11][PATCH 8/9] Define clone_with_pids() syscall
Container restart requires that a task have the same pid it had when it was
checkpointed. When containers are nested the tasks within the containers
exist in multiple pid namespaces and hence have multiple pids to specify
during restart.
clone_with_pids(), intended for use during restart, is the same as
clone(), except that it takes a 'pids' paramter. This parameter lets
caller choose specific pid numbers for the child process, in the
process's active and ancestor pid namespaces. (Descendant pid namespaces
in general don't matter since processes don't have pids in them anyway,
but see comments in copy_target_pids() regarding CLONE_NEWPID).
clone_with_pids() also attempts to address a second limitation of the
clone() system call. clone() is restricted to 32 clone flags and all but
one of these are in use. If more new clone flags are needed, we will be
forced to define a new variant of the clone() system call. To address
this, clone_with_pids() allows at least 64 clone flags with some room
for more if necessary.
To prevent unprivileged processes from misusing this interface,
clone_with_pids() currently needs CAP_SYS_ADMIN, when the 'pids' parameter
is non-NULL.
See Documentation/clone_with_pids in next patch for more details and an
example of its usage.
NOTE:
- System calls are restricted to 6 parameters and the number and sizes
of parameters needed for clone_with_pids() exceed 6 integers. The new
prototype works around this restriction while providing some
flexibility if clone_with_pids() needs to be further extended in the
future.
TODO:
- We should convert clone-flags to 64-bit value in all architectures.
Its probably best to do that as a separate patchset since clone_flags
touches several functions and that patchset seems independent of this
new system call.
Changelog[v11]:
- [Dave Hansen] Move clone_args validation checks to arch-indpeendent
code.
- [Oren Laadan] Make args_size a parameter to system call and remove
it from 'struct clone_args'
Changelog[v10]:
- Rename clone3() to clone_with_pids()
- [Linus Torvalds] Use PTREGSCALL() rather than the generic syscall
implementation
Changelog[v9]:
- [Roland McGrath, H. Peter Anvin] To avoid confusion on 64-bit
architectures split the new clone-flags into 'low' and 'high'
words and pass in the 'lower' flags as the first argument.
This would maintain similarity of the clone3() with clone()/
clone2(). Also has the side-effect of the name matching the
number of parameters :-)
- [Roland McGrath] Rename structure to 'clone_args' and add a
'child_stack_size' field
Changelog[v8]
- [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg'
must be 64-bit.
- clone2() is in use in IA64. Rename system call to clone3().
Changelog[v7]:
- [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
and group parameters into a new 'struct clone_struct' object.
Changelog[v6]:
- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
is constant across architectures.
- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
'unum_pids < 0' check.
Changelog[v4]:
- (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'
Changelog[v3]:
- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
in the target_pids[] list and setting it 0. See copy_target_pids()).
- (Oren Laadan) Specified target pids should apply only to youngest
pid-namespaces (see copy_target_pids())
- (Matt Helsley) Update patch description.
Changelog[v2]:
- Remove unnecessary printk and add a note to callers of
copy_target_pids() to free target_pids.
- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
'num_pids == 0' (fall back to normal clone()).
- Move arch-independent code (sanity checks and copy-in of target-pids)
into kernel/fork.c and simplify sys_clone_with_pids()
Changelog[v1]:
- Fixed some compile errors (had fixed these errors earlier in my
git tree but had not refreshed patches before emailing them)
Signed-off-by: Sukadev Bhattiprolu <[email protected]>
---
arch/x86/include/asm/syscalls.h | 1 +
arch/x86/include/asm/unistd_32.h | 1 +
arch/x86/kernel/entry_32.S | 1 +
arch/x86/kernel/process_32.c | 49 ++++++++++++++
arch/x86/kernel/syscall_table_32.S | 1 +
include/linux/sched.h | 2 +
include/linux/types.h | 11 +++
kernel/fork.c | 124 +++++++++++++++++++++++++++++++++++-
8 files changed, 189 insertions(+), 1 deletions(-)
diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
index 372b76e..3205f55 100644
--- a/arch/x86/include/asm/syscalls.h
+++ b/arch/x86/include/asm/syscalls.h
@@ -40,6 +40,7 @@ long sys_iopl(struct pt_regs *);
/* kernel/process_32.c */
int sys_clone(struct pt_regs *);
+int sys_clone_with_pids(struct pt_regs *);
int sys_execve(struct pt_regs *);
/* kernel/signal.c */
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6fb3c20..a427571 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -342,6 +342,7 @@
#define __NR_pwritev 334
#define __NR_rt_tgsigqueueinfo 335
#define __NR_perf_event_open 336
+#define __NR_clone_with_pids 337
#ifdef __KERNEL__
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index c097e7d..c7bd1f6 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -718,6 +718,7 @@ ptregs_##name: \
PTREGSCALL(iopl)
PTREGSCALL(fork)
PTREGSCALL(clone)
+PTREGSCALL(clone_with_pids)
PTREGSCALL(vfork)
PTREGSCALL(execve)
PTREGSCALL(sigaltstack)
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 4cf7956..55dd051 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -445,6 +445,55 @@ int sys_clone(struct pt_regs *regs)
return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
}
+int sys_clone_with_pids(struct pt_regs *regs)
+{
+ int rc;
+ int args_size;
+ struct clone_args kca;
+ unsigned long flags;
+ int __user *parent_tid_ptr;
+ int __user *child_tid_ptr;
+ unsigned long __user child_stack;
+ unsigned long stack_size;
+ unsigned int flags_low;
+ struct clone_args __user *uca;
+ pid_t __user *pids;
+
+ flags_low = regs->bx;
+ uca = (struct clone_args __user *)regs->cx;
+ args_size = regs->dx;
+ pids = (pid_t __user *)regs->di;
+
+ rc = fetch_clone_args_from_user(uca, args_size, &kca);
+ if (rc)
+ return rc;
+
+ /*
+ * TODO: Convert 'clone-flags' to 64-bits on all architectures.
+ * TODO: When ->clone_flags_high is non-zero, copy it in to the
+ * higher word(s) of 'flags':
+ *
+ * flags = (kca.clone_flags_high << 32) | flags_low;
+ */
+ flags = flags_low;
+ parent_tid_ptr = (int *)kca.parent_tid_ptr;
+ child_tid_ptr = (int *)kca.child_tid_ptr;
+
+ stack_size = (unsigned long)kca.child_stack_size;
+ child_stack = (unsigned long)kca.child_stack_base + stack_size;
+
+ if (!child_stack)
+ child_stack = regs->sp;
+
+ /*
+ * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value
+ * to several functions. Need to convert clone_flags to 64-bit.
+ */
+ return do_fork_with_pids(flags, child_stack, regs, stack_size,
+ parent_tid_ptr, child_tid_ptr, kca.nr_pids,
+ pids);
+}
+
/*
* sys_execve() executes a new program.
*/
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 0157cd2..1357e83 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
.long sys_pwritev
.long sys_rt_tgsigqueueinfo /* 335 */
.long sys_perf_event_open
+ .long ptregs_clone_with_pids
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 265efe5..0095ba0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2153,6 +2153,8 @@ extern int disallow_signal(int);
extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern int fetch_clone_args_from_user(struct clone_args __user *, int,
+ struct clone_args *);
extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
unsigned long, int __user *, int __user *,
unsigned int, pid_t __user *);
diff --git a/include/linux/types.h b/include/linux/types.h
index c42724f..0d08683 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -204,6 +204,17 @@ struct ustat {
char f_fpack[6];
};
+struct clone_args {
+ u64 clone_flags_high;
+ u64 child_stack_base;
+ u64 child_stack_size;
+ u64 parent_tid_ptr;
+ u64 child_tid_ptr;
+ u32 nr_pids;
+ u32 reserved0;
+ u64 reserved1;
+};
+
#endif /* __KERNEL__ */
#endif /* __ASSEMBLY__ */
#endif /* _LINUX_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 210e841..da28baf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1372,6 +1372,114 @@ struct task_struct * __cpuinit fork_idle(int cpu)
}
/*
+ * If user specified any 'target-pids' in @upid_setp, copy them from
+ * user and return a pointer to a local copy of the list of pids. The
+ * caller must free the list, when they are done using it.
+ *
+ * If user did not specify any target pids, return NULL (caller should
+ * treat this like normal clone).
+ *
+ * On any errors, return the error code
+ */
+static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids)
+{
+ int j;
+ int rc;
+ int size;
+ int knum_pids; /* # of pids needed in kernel */
+ pid_t *target_pids;
+
+ if (!unum_pids)
+ return NULL;
+
+ knum_pids = task_pid(current)->level + 1;
+ if (unum_pids > knum_pids)
+ return ERR_PTR(-EINVAL);
+
+ /*
+ * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
+ * and set it to 0. This last entry in target_pids[] corresponds to the
+ * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
+ * specified. If CLONE_NEWPID was not specified, this last entry will
+ * simply be ignored.
+ */
+ target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
+ if (!target_pids)
+ return ERR_PTR(-ENOMEM);
+
+ /*
+ * A process running in a level 2 pid namespace has three pid namespaces
+ * and hence three pid numbers. If this process is checkpointed,
+ * information about these three namespaces are saved. We refer to these
+ * namespaces as 'known namespaces'.
+ *
+ * If this checkpointed process is however restarted in a level 3 pid
+ * namespace, the restarted process has an extra ancestor pid namespace
+ * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
+ *
+ * During restart, the process requests specific pids for its 'known
+ * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
+ *
+ * Since the requested-pids correspond to 'known namespaces' and since
+ * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
+ * namespaces', copy requested pids to the back-end of target_pids[]
+ * (i.e before the last entry for CLONE_NEWPID mentioned above).
+ * Any entries in target_pids[] not corresponding to a requested pid
+ * will be set to zero and kernel assigns a pid in those namespaces.
+ *
+ * NOTE: The order of pids in target_pids[] is oldest pid namespace to
+ * youngest (target_pids[0] corresponds to init_pid_ns). i.e.
+ * the order is:
+ *
+ * - pids for 'unknown-namespaces' (if any)
+ * - pids for 'known-namespaces' (requested pids)
+ * - 0 in the last entry (for CLONE_NEWPID).
+ */
+ j = knum_pids - unum_pids;
+ size = unum_pids * sizeof(pid_t);
+
+ rc = copy_from_user(&target_pids[j], upids, size);
+ if (rc) {
+ rc = -EFAULT;
+ goto out_free;
+ }
+
+ return target_pids;
+
+out_free:
+ kfree(target_pids);
+ return ERR_PTR(rc);
+}
+
+int
+fetch_clone_args_from_user(struct clone_args __user *uca, int args_size,
+ struct clone_args *kca)
+{
+ int rc;
+
+ /*
+ * TODO: If size of clone_args is not what the kernel expects, it
+ * could be that kernel is newer and has an extended structure.
+ * When that happens, this check needs to be smarter. For now,
+ * assume exact match.
+ */
+ if (args_size != sizeof(struct clone_args))
+ return -EINVAL;
+
+ rc = copy_from_user(kca, uca, args_size);
+ if (rc)
+ return -EFAULT;
+
+ /*
+ * To avoid future compatibility issues, ensure unused fields are 0.
+ */
+ if (kca->reserved0 || kca->reserved1 || kca->clone_flags_high)
+ return -EINVAL;
+
+ return 0;
+}
+
+/*
* Ok, this is the main fork-routine.
*
* It copies the process, and if successful kick-starts
@@ -1389,7 +1497,7 @@ long do_fork_with_pids(unsigned long clone_flags,
struct task_struct *p;
int trace = 0;
long nr;
- pid_t *target_pids = NULL;
+ pid_t *target_pids;
/*
* Do some preliminary argument and permissions checking before we
@@ -1423,6 +1531,16 @@ long do_fork_with_pids(unsigned long clone_flags,
}
}
+ target_pids = copy_target_pids(num_pids, upids);
+ if (target_pids) {
+ if (IS_ERR(target_pids))
+ return PTR_ERR(target_pids);
+
+ nr = -EPERM;
+ if (!capable(CAP_SYS_ADMIN))
+ goto out_free;
+ }
+
/*
* When called from kernel_thread, don't do user tracing stuff.
*/
@@ -1484,6 +1602,10 @@ long do_fork_with_pids(unsigned long clone_flags,
} else {
nr = PTR_ERR(p);
}
+
+out_free:
+ kfree(target_pids);
+
return nr;
}
--
1.6.0.4
From: Sukadev Bhattiprolu <[email protected]>
Subject: [v11][PATCH 9/9] Document clone_with_pids() syscall
This gives a brief overview of the clone_with_pids() system call. We should
eventually describe more details in existing clone(2) man page or in
a new man page.
Changelog[v11]:
- [Dave Hansen] Move clone_args validation checks to arch-indpendent
code.
- [Oren Laadan] Make args_size a parameter to system call and remove
it from 'struct clone_args'
- [Oren Laadan] Fix some typos and clarify the order of pids in the
@pids parameter.
Changelog[v10]:
- Rename clone3() to clone_with_pids() and fix some typos.
- Modify example to show usage with the ptregs implementation.
Changelog[v9]:
- [Pavel Machek]: Fix an inconsistency and rename new file to
Documentation/clone3.
- [Roland McGrath, H. Peter Anvin] Updates to description and
example to reflect new prototype of clone3() and the updated/
renamed 'struct clone_args'.
Changelog[v8]:
- clone2() is already in use in IA64. Rename syscall to clone3()
- Add notes to say that we return -EINVAL if invalid clone flags
are specified or if the reserved fields are not 0.
Changelog[v7]:
- Rename clone_with_pids() to clone2()
- Changes to reflect new prototype of clone2() (using clone_struct).
Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Oren Laadan <[email protected]>
---
Documentation/clone_with_pids | 332 +++++++++++++++++++++++++++++++++++++++++
1 files changed, 332 insertions(+), 0 deletions(-)
create mode 100644 Documentation/clone_with_pids
diff --git a/Documentation/clone_with_pids b/Documentation/clone_with_pids
new file mode 100644
index 0000000..80e9b20
--- /dev/null
+++ b/Documentation/clone_with_pids
@@ -0,0 +1,332 @@
+
+struct clone_args {
+ u64 clone_flags_high;
+ u64 child_stack_base;
+ u64 child_stack_size;
+ u64 parent_tid_ptr;
+ u64 child_tid_ptr;
+ u32 nr_pids;
+ u32 reserved0;
+ u64 reserved1;
+};
+
+
+clone_with_pids(u32 flags_low, struct clone_args * __user cargs,
+ int cargs_size, pid_t * __user pids)
+
+ In addition to doing everything that clone() system call does,
+ the clone_with_pids() system call:
+
+ - allows additional clone flags (31 of 32 bits in the flags
+ parameter to clone() are in use)
+
+ - allows user to specify a pid for the child process in its
+ active and ancestor pid namespaces.
+
+ This system call is meant to be used when restarting an application
+ from a checkpoint. Such restart requires that the processes in the
+ application have the same pids they had when the application was
+ checkpointed. When containers are nested, the processes within the
+ containers exist in multiple pid namespaces and hence have multiple
+ pids to specify during restart.
+
+ The @flags_low parameter is identical to the 'clone_flags' parameter
+ in existing clone() system call.
+
+ The fields in 'struct clone_args' are meant to be used as follows:
+
+ u64 clone_flags_high:
+
+ When clone_with_pids() supports more than 32 clone flags, the
+ additional bits in the clone_flags should be specified in this
+ field. This field is currently unused and must be set to 0.
+
+ u64 child_stack_base;
+ u64 child_stack_size;
+
+ These two fields correspond to the 'child_stack' fields
+ in clone() and clone2() system calls (on IA64).
+
+ u64 parent_tid_ptr;
+ u64 child_tid_ptr;
+
+ These two fields correspond to the 'parent_tid_ptr' and
+ 'child_tid_ptr' fields in the clone() system call
+
+ u32 nr_pids;
+
+ nr_pids specifies the number of pids in the @pids array
+ parameter to clone_with_pids() (see below). nr_pids should
+ not exceed the current nesting level of the calling process
+ (i.e if the process is in init_pid_ns, nr_pids must be 1,
+ if process is in a pid namespace that is a child of
+ init-pid-ns, nr_pids cannot exceed 2, and so on).
+
+ u32 reserved0;
+ u64 reserved1;
+
+ These fields are intended to extend the functionality of the
+ clone_with_pids() in the future, while preserving backward
+ compatibility. They must be set to 0 for now.
+
+ The @cargs_size parameter specifes the sizeof(struct clone_args) and
+ is intended to enable extending this structure in the future, while
+ preserving backward compatibility. For now, this field must be set
+ to the sizeof(struct clone_args) and this size must match the kernel's
+ view of the structure.
+
+ The @pids parameter defines the set of pids that should be assigned to
+ the child process in its active and ancestor pid namespaces. The
+ descendant pid namespaces do not matter since a process does not have a
+ pid in descendant namespaces, unless the process is in a new pid
+ namespace in which case the process is a container-init (and must have
+ the pid 1 in that namespace).
+
+ See CLONE_NEWPID section of clone(2) man page for details about pid
+ namespaces.
+
+ If a pid in the @pids list is 0, the kernel will assign the next
+ available pid in the pid namespace.
+
+ If a pid in the @pids list is non-zero, the kernel tries to assign
+ the specified pid in that namespace. If that pid is already in use
+ by another process, the system call fails (see EBUSY below).
+
+ The order of pids in @pids is oldest in pids[0] to youngest pid
+ namespace in pids[nr_pids-1]. If the number of pids specified in the
+ @pids list is fewer than the nesting level of the process, the pids
+ are applied from youngest namespace. i.e if the process is nested in
+ a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
+ are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
+ have a pid of '0' (the kernel will assign a pid in those namespaces).
+
+ On success, the system call returns the pid of the child process in
+ the parent's active pid namespace.
+
+ On failure, clone_with_pids() returns -1 and sets 'errno' to one of
+ following values (the child process is not created).
+
+ EPERM Caller does not have the CAP_SYS_ADMIN privilege needed to
+ specify the pids in this call (if pids are not specifed
+ CAP_SYS_ADMIN is not required).
+
+ EINVAL The number of pids specified in 'clone_args.nr_pids' exceeds
+ the current nesting level of parent process
+
+ EINVAL Not all specified clone-flags are valid.
+
+ EINVAL The reserved fields in the clone_args argument are not 0.
+
+ EBUSY A requested pid is in use by another process in that namespace.
+
+---
+/*
+ * Example clone_with_pids() usage - Create a child with pid CHILD_TID1 if
+ * program is run in init_pid_ns. If program is run in a child of init_pid_ns,
+ * create the child process with pid CHILD_TID2.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <signal.h>
+#include <errno.h>
+#include <unistd.h>
+#include <wait.h>
+#include <sys/syscall.h>
+
+#define __NR_clone_with_pids 337
+#define CLONE_NEWPID 0x20000000
+#define CLONE_CHILD_SETTID 0x01000000
+#define CLONE_PARENT_SETTID 0x00100000
+#define CLONE_UNUSED 0x00001000
+
+#define STACKSIZE 8192
+
+typedef unsigned long long u64;
+typedef unsigned int u32;
+typedef int pid_t;
+struct clone_args {
+ u64 clone_flags_high;
+
+ u64 child_stack_base;
+ u64 child_stack_size;
+
+ u64 parent_tid_ptr;
+ u64 child_tid_ptr;
+
+ u32 nr_pids;
+
+ u32 reserved0;
+ u64 reserved1;
+};
+
+#define exit _exit
+
+/*
+ * Following clone_with_pids() is based on code posted by Oren Laadan at:
+ * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html
+ */
+#if defined(__i386__) && defined(__NR_clone_with_pids)
+
+int clone_with_pids(int flags_low, struct clone_args *clone_args, int args_size,
+ int *pids)
+{
+ long retval;
+
+ __asm__ __volatile__(
+ "movl %0, %%ebx\n\t" /* flags -> 1st (ebx) */
+ "movl %1, %%ecx\n\t" /* clone_args -> 2nd (ecx)*/
+ "movl %2, %%edx\n\t" /* args_size -> 3rd (edx) */
+ "movl %3, %%edi\n\t" /* pids -> 4th (edi)*/
+ "pushl %%ebp\n\t" /* save value of ebp */
+ :
+ :"b" (flags_low),
+ "c" (clone_args),
+ "d" (args_size),
+ "D" (pids)
+ );
+
+ __asm__ __volatile__(
+ "int $0x80\n\t" /* Linux/i386 system call */
+ "testl %0,%0\n\t" /* check return value */
+ "jne 1f\n\t" /* jump if parent */
+ "popl %%ebx\n\t" /* get subthread function */
+ "call *%%ebx\n\t" /* start subthread function */
+ "movl %2,%0\n\t"
+ "int $0x80\n" /* exit system call: exit subthread */
+ "1:\n\t"
+ "popl %%ebp\t" /* restore parent's ebp */
+ :"=a" (retval)
+ :"0" (__NR_clone_with_pids), "i" (__NR_exit)
+ :"ebx", "ecx", "edx"
+ );
+
+ if (retval < 0) {
+ errno = -retval;
+ retval = -1;
+ }
+ return retval;
+}
+
+/*
+ * Allocate a stack for the clone-child and arrange to have the child
+ * execute @child_fn with @child_arg as the argument.
+ */
+void *setup_stack(int (*child_fn)(void *), void *child_arg)
+{
+ void *child_stack;
+ void **new_stack;
+
+ child_stack = malloc(STACKSIZE);
+ if (!child_stack) {
+ perror("malloc()");
+ exit(1);
+ }
+ child_stack = (char *)child_stack + (STACKSIZE - 4);
+
+ new_stack = (void **)child_stack;
+ *--new_stack = child_arg;
+ *--new_stack = child_fn;
+
+ return new_stack;
+}
+
+#endif
+
+/* gettid() is a bit more useful than getpid() when messing with clone() */
+int gettid()
+{
+ int rc;
+
+ rc = syscall(__NR_gettid, 0, 0, 0);
+ if (rc < 0) {
+ printf("rc %d, errno %d\n", rc, errno);
+ exit(1);
+ }
+ return rc;
+}
+
+#define CHILD_TID1 377
+#define CHILD_TID2 25
+struct clone_args clone_args;
+void *child_arg = &clone_args;
+int child_tid;
+
+int do_child(void *arg)
+{
+ struct clone_args *cs = (struct clone_args *)arg;
+ int ctid;
+
+ /* Verify we pushed the arguments correctly on the stack... */
+ if (arg != child_arg) {
+ printf("Child: Incorrect child arg pointer, expected %p,"
+ "actual %p\n", child_arg, arg);
+ exit(1);
+ }
+
+ /* ... and that we got the thread-id we expected */
+ ctid = *((int *)cs->child_tid_ptr);
+ if (ctid != CHILD_TID) {
+ printf("Child: Incorrect child tid, expected %d, actual %d\n",
+ CHILD_TID, ctid);
+ exit(1);
+ }
+ sleep(3);
+
+ printf("[%d, %d]: Child exiting\n", getpid(), ctid);
+ exit(0);
+}
+
+static int do_clone(int (*child_fn)(void *), void *child_arg,
+ unsigned int flags_low, int nr_pids, pid_t *pids_list)
+{
+ int rc;
+ void *stack;
+ struct clone_args *ca = &clone_args;
+ int args_size;
+
+ stack = setup_stack(child_fn, child_arg);
+
+ memset(ca, 0, sizeof(*ca));
+
+ ca->child_stack_base = (u64)stack;
+ ca->child_tid_ptr = (u64)&child_tid;
+ ca->nr_pids = nr_pids;
+
+ args_size = sizeof(struct clone_args);
+ rc = clone_with_pids(flags_low, ca, args_size, pids_list);
+
+ printf("[%d, %d]: clone_with_pids() returned %d, error %d\n",
+ getpid(), gettid(), rc, errno);
+
+ return rc;
+}
+
+pid_t pids_list[] = { CHILD_TID1, CHILD_TID2 };
+main()
+{
+ int rc, pid, ret, status;
+ unsigned long flags;
+ int nr_pids = 1;
+
+ flags = SIGCHLD|CLONE_PARENT_SETTID|CLONE_CHILD_SETTID;
+
+ pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list);
+
+ printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid);
+
+ rc = waitpid(pid, &status, __WALL);
+ if (rc < 0) {
+ printf("waitpid(): rc %d, error %d\n", rc, errno);
+ } else {
+ printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(),
+ gettid(), rc, status);
+
+ if (WIFEXITED(status)) {
+ printf("\t EXITED, %d\n", WEXITSTATUS(status));
+ } else if (WIFSIGNALED(status)) {
+ printf("\t SIGNALED, %d\n", WTERMSIG(status));
+ }
+ }
+}
--
1.6.0.4
Quoting Sukadev Bhattiprolu ([email protected]):
> + stack_size = (unsigned long)kca.child_stack_size;
> + child_stack = (unsigned long)kca.child_stack_base + stack_size;
> +
> + if (!child_stack)
> + child_stack = regs->sp;
I'm hooking up the s390 version right now. Do you think you should
make this
if (!kca.child_stack_base)
child_stack = regs->sp;
?
I suppose that in general if I pass in a NULL kca.child_stack_base
I'll also pass in a 0 stacksize, but as a user I'd expect that if
I pass in NULL, the size gets ignored. Instead, if I pass in NULL
plus a size, then the kernel will take (void *)size as the stacktop.
-serge
Quoting Sukadev Bhattiprolu ([email protected]):
...
> + If a pid in the @pids list is non-zero, the kernel tries to assign
> + the specified pid in that namespace. If that pid is already in use
> + by another process, the system call fails (see EBUSY below).
> +
> + The order of pids in @pids is oldest in pids[0] to youngest pid
> + namespace in pids[nr_pids-1]. If the number of pids specified in the
In the sys_choosepid() discussion, Matt suggested it would be more
user-friendly to have the pid for the youngest pidns be pids[0].
That way the user doesn't have to know their pidns depth.
-serge
On Fri, Nov 06, 2009 at 12:39:36PM -0600, Serge E. Hallyn wrote:
> Quoting Sukadev Bhattiprolu ([email protected]):
> ...
> > + If a pid in the @pids list is non-zero, the kernel tries to assign
> > + the specified pid in that namespace. If that pid is already in use
> > + by another process, the system call fails (see EBUSY below).
> > +
> > + The order of pids in @pids is oldest in pids[0] to youngest pid
> > + namespace in pids[nr_pids-1]. If the number of pids specified in the
>
> In the sys_choosepid() discussion, Matt suggested it would be more
> user-friendly to have the pid for the youngest pidns be pids[0].
> That way the user doesn't have to know their pidns depth.
As far as I could see, Suka's solution also does not require knowing
the pidns depth (aka level). He made it so that copy_from_user()
adjusts its destination using the discrepancy between the number of
pids passed and the number of levels.
If userspace passes an array with n pids and there are k namespace levels
then clone_with_pids() makes sure that the kernel sees a pid array like:
index 0 ... k - (n + 1) ... k - 1
+-----------------------+-------------------------+
pid_t | 0 ..................0 | <copied from userspace> |
+-----------------------+-------------------------+
So even though the order is different from choosepid() the calling
task still doesn't need to know its pidns level. Of course, just
like choosepid(), n <= k or userspace will get EINVAL.
Cheers,
-Matt Helsley
On Fri, Nov 06, 2009 at 12:18:14PM -0800, Matt Helsley wrote:
> On Fri, Nov 06, 2009 at 12:39:36PM -0600, Serge E. Hallyn wrote:
> > Quoting Sukadev Bhattiprolu ([email protected]):
> > ...
> > > + If a pid in the @pids list is non-zero, the kernel tries to assign
> > > + the specified pid in that namespace. If that pid is already in use
> > > + by another process, the system call fails (see EBUSY below).
> > > +
> > > + The order of pids in @pids is oldest in pids[0] to youngest pid
> > > + namespace in pids[nr_pids-1]. If the number of pids specified in the
> >
> > In the sys_choosepid() discussion, Matt suggested it would be more
> > user-friendly to have the pid for the youngest pidns be pids[0].
> > That way the user doesn't have to know their pidns depth.
>
> As far as I could see, Suka's solution also does not require knowing
> the pidns depth (aka level). He made it so that copy_from_user()
> adjusts its destination using the discrepancy between the number of
> pids passed and the number of levels.
>
> If userspace passes an array with n pids and there are k namespace levels
> then clone_with_pids() makes sure that the kernel sees a pid array like:
>
> index 0 ... k - (n + 1) ... k - 1
> +-----------------------+-------------------------+
> pid_t | 0 ..................0 | <copied from userspace> |
> +-----------------------+-------------------------+
(diagram assumes n != k. If n == k then pids[0] is the pid desired
in the initial namespace..)
>
> So even though the order is different from choosepid() the calling
> task still doesn't need to know its pidns level. Of course, just
> like choosepid(), n <= k or userspace will get EINVAL.
Forgot to mention that I prefer the way choosepid orders the pids.
It's not inspired by the way that the kernel implements pid namespaces
and has more to do with the way userspace sees things (IMHO). I don't
know if it makes more sense to change clone_with_pids() or have
[e]glibc wrappers swap the array contents.
Cheers,
-Matt Helsley
Matt Helsley [[email protected]] wrote:
| > If userspace passes an array with n pids and there are k namespace levels
| > then clone_with_pids() makes sure that the kernel sees a pid array like:
| >
| > index 0 ... k - (n + 1) ... k - 1
| > +-----------------------+-------------------------+
| > pid_t | 0 ..................0 | <copied from userspace> |
| > +-----------------------+-------------------------+
|
| (diagram assumes n != k. If n == k then pids[0] is the pid desired
| in the initial namespace..)
True.
Also I was not sure if we should prevent choosing pids in ancestor containers.
since a process is not even supposed to know of ancestor namespaces. Is there
a need for choosing pids in those namespaces.
|
| >
| > So even though the order is different from choosepid() the calling
| > task still doesn't need to know its pidns level. Of course, just
| > like choosepid(), n <= k or userspace will get EINVAL.
|
| Forgot to mention that I prefer the way choosepid orders the pids.
| It's not inspired by the way that the kernel implements pid namespaces
| and has more to do with the way userspace sees things (IMHO).
Hmm, In general we C/R a descendant container. So the way userspace
sees it at that point is "what are the pids of this process in my current
and in any descendant namespaces". IOW, the pid of container from which
we checkpoint seems more interesting first - right ? If so, the pids[]
are better ordered from older namespace to younger namespace ?
| I don't know if it makes more sense to change clone_with_pids() or have
| [e]glibc wrappers swap the array contents.
|
| Cheers,
| -Matt Helsley
| _______________________________________________
| Containers mailing list
| [email protected]
| https://lists.linux-foundation.org/mailman/listinfo/containers
Serge E. Hallyn [[email protected]] wrote:
| Quoting Sukadev Bhattiprolu ([email protected]):
| > + stack_size = (unsigned long)kca.child_stack_size;
| > + child_stack = (unsigned long)kca.child_stack_base + stack_size;
| > +
| > + if (!child_stack)
| > + child_stack = regs->sp;
|
| I'm hooking up the s390 version right now. Do you think you should
| make this
|
| if (!kca.child_stack_base)
| child_stack = regs->sp;
|
| ?
|
| I suppose that in general if I pass in a NULL kca.child_stack_base
| I'll also pass in a 0 stacksize, but as a user I'd expect that if
| I pass in NULL, the size gets ignored. Instead, if I pass in NULL
| plus a size, then the kernel will take (void *)size as the stacktop.
Good point. Like copy_thread() on IA64, how about ignoring 'stack_size'
if base is NULL ?
child_stack = 0UL;
if (kca.child_stack_base)
child_stack = (unsigned long)kca.child_stack_base + stack_size;
if (!child_stack)
child_stack = regs->sp;
The other question is whether we should force all architectures to pass in
the stack *base* ? clone(2) man page says:
Stacks grow downwards on all processors that run Linux (except
the HP PA processors), so child_stack usually points to the topmost
address of the memory space set up for the child stack.
To be compatibile with clone() on most architectures, should we rename
'clone_args.child_stack_base' to 'clone_args.child_stack' and let
architectures use this field like they currently use the 'child_stack'
parameter to clone(2) ?
So x86 would pass in address of top-of-stack while HP-PA can pass in address
of base-of-stack.
Arnd, Roland, Peter please let me know if you have any inputs on this.
Sukadev
Sukadev Bhattiprolu wrote:
> Matt Helsley [[email protected]] wrote:
> | > If userspace passes an array with n pids and there are k namespace levels
> | > then clone_with_pids() makes sure that the kernel sees a pid array like:
> | >
> | > index 0 ... k - (n + 1) ... k - 1
> | > +-----------------------+-------------------------+
> | > pid_t | 0 ..................0 | <copied from userspace> |
> | > +-----------------------+-------------------------+
> |
> | (diagram assumes n != k. If n == k then pids[0] is the pid desired
> | in the initial namespace..)
>
> True.
>
> Also I was not sure if we should prevent choosing pids in ancestor containers.
> since a process is not even supposed to know of ancestor namespaces. Is there
> a need for choosing pids in those namespaces.
IMHO this is a bit confusing.
A process observes a single namespace - the one in which it "lives".
There is no such thing as descendant namespaces for that process.
There may be ancestor namespaces.
The clone occurs in the context of the process. So the process that
is forking _must_ indicate pids in _ancestor_ namespaces if it wishes
to select pids in those (as is the case in c/r).
>
> |
> | >
> | > So even though the order is different from choosepid() the calling
> | > task still doesn't need to know its pidns level. Of course, just
> | > like choosepid(), n <= k or userspace will get EINVAL.
> |
> | Forgot to mention that I prefer the way choosepid orders the pids.
> | It's not inspired by the way that the kernel implements pid namespaces
> | and has more to do with the way userspace sees things (IMHO).
>
> Hmm, In general we C/R a descendant container. So the way userspace
> sees it at that point is "what are the pids of this process in my current
> and in any descendant namespaces". IOW, the pid of container from which
> we checkpoint seems more interesting first - right ? If so, the pids[]
> are better ordered from older namespace to younger namespace ?
When we checkpoint, we use an external process to record the state of
(current or) descendant namespaces.
When we restart, we run in the context of the restarting process, so
we select a pid in the current and _ancestor_ namespaces.
So the order of pids as it (will) appear in the checkpoint image for
a given process will be from an ancestor down to descendant namespaces.
And this is how we (will) hand it over to eclone().
>
> | I don't know if it makes more sense to change clone_with_pids() or have
> | [e]glibc wrappers swap the array contents.
I prefer to decide now on an order and stick to it in the kernel and
in glibc.
Oren
Quoting Oren Laadan ([email protected]):
> Sukadev Bhattiprolu wrote:
> > Matt Helsley [[email protected]] wrote:
> > | > If userspace passes an array with n pids and there are k namespace levels
> > | > then clone_with_pids() makes sure that the kernel sees a pid array like:
> > | >
> > | > index 0 ... k - (n + 1) ... k - 1
> > | > +-----------------------+-------------------------+
> > | > pid_t | 0 ..................0 | <copied from userspace> |
> > | > +-----------------------+-------------------------+
> > |
> > | (diagram assumes n != k. If n == k then pids[0] is the pid desired
> > | in the initial namespace..)
> >
> > True.
> >
> > Also I was not sure if we should prevent choosing pids in ancestor containers.
> > since a process is not even supposed to know of ancestor namespaces. Is there
> > a need for choosing pids in those namespaces.
Yes, that is necessary.
> > | I don't know if it makes more sense to change clone_with_pids() or have
> > | [e]glibc wrappers swap the array contents.
>
> I prefer to decide now on an order and stick to it in the kernel and
> in glibc.
Agreed!
I'd forgotten that, as Matt said, we can just specify pids to the depth
that we want, so I guess the current order is fine.
-serge
Here is a stab at the s390 syscall.
>From f710be4f1296d50551210bcc9ff6ba25d288bc46 Mon Sep 17 00:00:00 2001
From: Serge E. Hallyn <[email protected]>
Date: Fri, 6 Nov 2009 19:03:43 -0500
Subject: [PATCH 1/1] implement s390 clone_with_pids syscall
This does the s390 hook for v11 of clone-with-pids.
I've got a program using the syscall successfully passing
args to the child function and continuing to run - but I
haven't yet gotten that hooked into user-cr/restart.c
successfully.
The core of my user-space code to use it is:
int do_child(void *arg)
{
int iarg = (int ) arg;
printf("here i am, i was passed %d, my tid is %d\n", iarg, gettid());
return 0;
}
#define do_cwp(flags, pids, args, sz) \
( { \
register unsigned long int __r2 asm ("2") = (unsigned long int)(flags); \
register unsigned long int __r3 asm ("3") = (unsigned long int)(args); \
register unsigned long int __r4 asm ("4") = (unsigned long int)(sz); \
register unsigned long int __r5 asm ("5") = (unsigned long int)(pids); \
register unsigned long int __result asm ("2"); \
__asm__ __volatile__( \
" lghi %%r1,%5\n" /* put __NR_cwp in r1 for svc 0 */ \
" svc 0\n" /* do __NR_cwp syscall */ \
" ltgr %%r2,%%r2\n" /* returned 0? */ \
" jnz 1f\n" /* if not goto label 1 */ \
" lg %%r3,0(%%r15)\n" /* get fnarg off stack into arg 1 */ \
" lg %%r2,8(%%r15)\n" /* get fn off stack into r3 for basr*/ \
" lgr %%r1,%%r15\n" /* tmp store old stack pointer */ \
" aghi %%r15,-160\n" /* move the stack */ \
" stg %%r1,0(%%r15)\n" /* and save old stack pointer */ \
" basr %%r14,%%r3\n" /* call fn(arg) */ \
" svc 1\n" /* call exit */ \
" 1:\n" \
: "=d" (__result) \
: "0" (__r2), "d" (__r3), "d" (__r4), "d" (__r5), \
"i" (__NR_clone_with_pids) \
: "1", "cc", "memory"); \
__result; \
} )
int clone_with_pids(int (*fn)(void *), int flags, int nrpids, int *pids,
void *fnarg)
{
long retval;
struct clone_args clone_args, *ca = &clone_args;
int stacksize;
void *sb;
u64 *s;
int i;
memset(ca, 0, sizeof(struct clone_args));
stacksize = 4*getpagesize();
sb = (void *) malloc(stacksize);
if (!sb) {
perror("malloc");
_exit(1);
}
ca->child_stack_base = (u64) sb;
ca->child_stack_size = stacksize-8;
s = (u64 *)(sb + ca->child_stack_size);
*--s = (u64)fnarg;
*--s = (u64)fn;
ca->child_stack_size -= 16;
ca->nr_pids = nrpids;
retval = do_cwp(flags, pids, ca, sizeof(struct clone_args));
return retval;
}
Signed-off-by: Serge E. Hallyn <[email protected]>
---
arch/s390/include/asm/unistd.h | 3 +-
arch/s390/kernel/compat_linux.c | 50 ++++++++++++++++++++++++++++++++++++++
arch/s390/kernel/process.c | 51 +++++++++++++++++++++++++++++++++++++++
arch/s390/kernel/syscalls.S | 1 +
4 files changed, 104 insertions(+), 1 deletions(-)
diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index cb5232d..ae9474e 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -269,7 +269,8 @@
#define __NR_pwritev 329
#define __NR_rt_tgsigqueueinfo 330
#define __NR_perf_event_open 331
-#define NR_syscalls 332
+#define __NR_clone_with_pids 332
+#define NR_syscalls 333
/*
* There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/compat_linux.c b/arch/s390/kernel/compat_linux.c
index 0debcec..1750fae 100644
--- a/arch/s390/kernel/compat_linux.c
+++ b/arch/s390/kernel/compat_linux.c
@@ -762,6 +762,56 @@ asmlinkage long sys32_write(unsigned int fd, char __user * buf, size_t count)
return sys_write(fd, buf, count);
}
+asmlinkage long sys32_clone_with_pids(void)
+{
+ int rc;
+ struct pt_regs *regs = task_pt_regs(current);
+ int args_size;
+ struct clone_args kca;
+ unsigned long flags;
+ int __user *parent_tid_ptr;
+ int __user *child_tid_ptr;
+ unsigned long __user child_stack;
+ unsigned long stack_size;
+ unsigned int flags_low;
+ struct clone_args __user *uca;
+ pid_t __user *pids;
+
+ flags_low = regs->gprs[2] & 0xffffffffUL;
+ uca = compat_ptr(regs->gprs[3]);
+ args_size = regs->gprs[4] & 0xffffffffUL;
+ pids = compat_ptr(regs->gprs[5]);
+
+ rc = fetch_clone_args_from_user(uca, args_size, &kca);
+ if (rc)
+ return rc;
+
+ /*
+ * TODO: Convert 'clone-flags' to 64-bits on all architectures.
+ * TODO: When ->clone_flags_high is non-zero, copy it in to the
+ * higher word(s) of 'flags':
+ *
+ * flags = (kca.clone_flags_high << 32) | flags_low;
+ */
+ flags = flags_low;
+ parent_tid_ptr = (int *)kca.parent_tid_ptr;
+ child_tid_ptr = (int *)kca.child_tid_ptr;
+
+ stack_size = (unsigned long)kca.child_stack_size;
+ child_stack = (unsigned long)kca.child_stack_base + stack_size;
+
+ if (!child_stack)
+ child_stack = regs->gprs[15];
+
+ /*
+ * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value
+ * to several functions. Need to convert clone_flags to 64-bit.
+ */
+ return do_fork_with_pids(flags, child_stack, regs, stack_size,
+ parent_tid_ptr, child_tid_ptr, kca.nr_pids,
+ pids);
+}
+
/*
* 31 bit emulation wrapper functions for sys_fadvise64/fadvise64_64.
* These need to rewrite the advise values for POSIX_FADV_{DONTNEED,NOREUSE}
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 5417eb5..e27a1b4 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -241,6 +241,57 @@ SYSCALL_DEFINE4(clone, unsigned long, newsp, unsigned long, clone_flags,
parent_tidptr, child_tidptr);
}
+SYSCALL_DEFINE0(clone_with_pids)
+{
+ int rc;
+ struct pt_regs *regs = task_pt_regs(current);
+ int args_size;
+ struct clone_args kca;
+ unsigned long flags;
+ int __user *parent_tid_ptr;
+ int __user *child_tid_ptr;
+ unsigned long __user child_stack;
+ unsigned long stack_size;
+ unsigned int flags_low;
+ struct clone_args __user *uca;
+ pid_t __user *pids;
+
+ flags_low = regs->gprs[2];
+ uca = (struct clone_args __user *)regs->gprs[3];
+ args_size = regs->gprs[4];
+ pids = (pid_t __user *)regs->gprs[5];
+
+ rc = fetch_clone_args_from_user(uca, args_size, &kca);
+ if (rc)
+ return rc;
+
+ /*
+ * TODO: Convert 'clone-flags' to 64-bits on all architectures.
+ * TODO: When ->clone_flags_high is non-zero, copy it in to the
+ * higher word(s) of 'flags':
+ *
+ * flags = (kca.clone_flags_high << 32) | flags_low;
+ */
+ flags = flags_low;
+ parent_tid_ptr = (int *)kca.parent_tid_ptr;
+ child_tid_ptr = (int *)kca.child_tid_ptr;
+
+ stack_size = (unsigned long)kca.child_stack_size;
+ child_stack = (unsigned long)kca.child_stack_base;
+ if (child_stack)
+ child_stack += stack_size;
+ else
+ child_stack = regs->gprs[15];
+
+ /*
+ * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value
+ * to several functions. Need to convert clone_flags to 64-bit.
+ */
+ return do_fork_with_pids(flags, child_stack, regs, stack_size,
+ parent_tid_ptr, child_tid_ptr, kca.nr_pids,
+ pids);
+}
+
/*
* This is trivial, and on the face of it looks like it
* could equally well be done in user mode.
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index 30eca07..c6dc240 100644
--- a/arch/s390/kernel/syscalls.S
+++ b/arch/s390/kernel/syscalls.S
@@ -340,3 +340,4 @@ SYSCALL(sys_preadv,sys_preadv,compat_sys_preadv_wrapper)
SYSCALL(sys_pwritev,sys_pwritev,compat_sys_pwritev_wrapper)
SYSCALL(sys_rt_tgsigqueueinfo,sys_rt_tgsigqueueinfo,compat_sys_rt_tgsigqueueinfo_wrapper) /* 330 */
SYSCALL(sys_perf_event_open,sys_perf_event_open,sys_perf_event_open_wrapper)
+SYSCALL(sys_clone_with_pids,sys_clone_with_pids,sys_clone_with_pids_wrapper)
--
1.6.1