2009-03-10 21:46:24

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

On Thu, Feb 26, 2009 at 06:57:55PM +0300, Alexey Dobriyan wrote:
> On Thu, Feb 12, 2009 at 03:04:05PM -0800, Dave Hansen wrote:
> > dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... kernel/cpt/ | diffstat

> > 47 files changed, 20702 insertions(+)
> >
> > One important thing that leaves out is the interaction that this code
> > has with the rest of the kernel. That's critically important when
> > considering long-term maintenance, and I'd be curious how the OpenVZ
> > folks view it.
>
> OpenVZ as-is in some cases wants some functions to be made global
> (and if C/R code will be modular, exported). Or probably several
> iterators added.
>
> But it's negligible amount of changes compared to main code.

Here is what C/R code wants from pid allocator.

With the introduction of hierarchical PID namespaces, struct pid can
have not one but many numbers -- tuple (pid_0, pid_1, ..., pid_N),
where pid_i is pid number in pid_ns which has level i.

Now root pid_ns of container has level n -- numbers from level n to N
inclusively should be dumped and restored.

During struct pid creation first n-1 numbers can be anything, because the're
outside of pid_ns, but the rest should be the same.

Code will be ifdeffed and commented, but anyhow, this is an example of
change C/R will require from the rest of the kernel.



--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -182,6 +182,34 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
return -1;
}

+static int set_pidmap(struct pid_namespace *pid_ns, pid_t pid)
+{
+ int offset;
+ struct pidmap *map;
+
+ offset = pid & BITS_PER_PAGE_MASK;
+ map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
+ if (unlikely(!map->page)) {
+ void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+ /*
+ * Free the page if someone raced with us
+ * installing it:
+ */
+ spin_lock_irq(&pidmap_lock);
+ if (map->page)
+ kfree(page);
+ else
+ map->page = page;
+ spin_unlock_irq(&pidmap_lock);
+ if (unlikely(!map->page))
+ return -ENOMEM;
+ }
+ if (test_and_set_bit(offset, map->page))
+ return -EBUSY;
+ atomic_dec(&map->nr_free);
+ return pid;
+}
+
int next_pidmap(struct pid_namespace *pid_ns, int last)
{
int offset;
@@ -239,7 +267,7 @@ void free_pid(struct pid *pid)
call_rcu(&pid->rcu, delayed_put_pid);
}

-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, int *cr_nr, unsigned int cr_level)
{
struct pid *pid;
enum pid_type type;
@@ -253,7 +281,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)

tmp = ns;
for (i = ns->level; i >= 0; i--) {
- nr = alloc_pidmap(tmp);
+ if (cr_nr && ns->level - i <= cr_level)
+ nr = set_pidmap(tmp, cr_nr[ns->level - i]);
+ else
+ nr = alloc_pidmap(tmp);
if (nr < 0)
goto out_free;


2009-03-10 23:28:42

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

Quoting Alexey Dobriyan ([email protected]):
> On Thu, Feb 26, 2009 at 06:57:55PM +0300, Alexey Dobriyan wrote:
> > On Thu, Feb 12, 2009 at 03:04:05PM -0800, Dave Hansen wrote:
> > > dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... kernel/cpt/ | diffstat
>
> > > 47 files changed, 20702 insertions(+)
> > >
> > > One important thing that leaves out is the interaction that this code
> > > has with the rest of the kernel. That's critically important when
> > > considering long-term maintenance, and I'd be curious how the OpenVZ
> > > folks view it.
> >
> > OpenVZ as-is in some cases wants some functions to be made global
> > (and if C/R code will be modular, exported). Or probably several
> > iterators added.
> >
> > But it's negligible amount of changes compared to main code.
>
> Here is what C/R code wants from pid allocator.

Yup. Agreed. That is exactly what I would have thought it would look
like. We may have found the first bit of helper code we can all agree
on for c/r? :)

Eric may disagree as he wanted to play games with
/proc/sys/kernel/pid_max, but that seems hard to pull off for nested
pid namespaces.


thanks,
-serge

> With the introduction of hierarchical PID namespaces, struct pid can
> have not one but many numbers -- tuple (pid_0, pid_1, ..., pid_N),
> where pid_i is pid number in pid_ns which has level i.
>
> Now root pid_ns of container has level n -- numbers from level n to N
> inclusively should be dumped and restored.
>
> During struct pid creation first n-1 numbers can be anything, because the're
> outside of pid_ns, but the rest should be the same.
>
> Code will be ifdeffed and commented, but anyhow, this is an example of
> change C/R will require from the rest of the kernel.
>
>
>
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -182,6 +182,34 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
> return -1;
> }
>
> +static int set_pidmap(struct pid_namespace *pid_ns, pid_t pid)
> +{
> + int offset;
> + struct pidmap *map;
> +
> + offset = pid & BITS_PER_PAGE_MASK;
> + map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
> + if (unlikely(!map->page)) {
> + void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
> + /*
> + * Free the page if someone raced with us
> + * installing it:
> + */
> + spin_lock_irq(&pidmap_lock);
> + if (map->page)
> + kfree(page);
> + else
> + map->page = page;
> + spin_unlock_irq(&pidmap_lock);
> + if (unlikely(!map->page))
> + return -ENOMEM;
> + }
> + if (test_and_set_bit(offset, map->page))
> + return -EBUSY;
> + atomic_dec(&map->nr_free);
> + return pid;
> +}
> +
> int next_pidmap(struct pid_namespace *pid_ns, int last)
> {
> int offset;
> @@ -239,7 +267,7 @@ void free_pid(struct pid *pid)
> call_rcu(&pid->rcu, delayed_put_pid);
> }
>
> -struct pid *alloc_pid(struct pid_namespace *ns)
> +struct pid *alloc_pid(struct pid_namespace *ns, int *cr_nr, unsigned int cr_level)
> {
> struct pid *pid;
> enum pid_type type;
> @@ -253,7 +281,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>
> tmp = ns;
> for (i = ns->level; i >= 0; i--) {
> - nr = alloc_pidmap(tmp);
> + if (cr_nr && ns->level - i <= cr_level)
> + nr = set_pidmap(tmp, cr_nr[ns->level - i]);
> + else
> + nr = alloc_pidmap(tmp);
> if (nr < 0)
> goto out_free;
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/containers

2009-03-11 08:27:16

by Cedric Le Goater

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

Alexey Dobriyan wrote:
> On Thu, Feb 26, 2009 at 06:57:55PM +0300, Alexey Dobriyan wrote:
>> On Thu, Feb 12, 2009 at 03:04:05PM -0800, Dave Hansen wrote:
>>> dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... kernel/cpt/ | diffstat
>
>>> 47 files changed, 20702 insertions(+)
>>>
>>> One important thing that leaves out is the interaction that this code
>>> has with the rest of the kernel. That's critically important when
>>> considering long-term maintenance, and I'd be curious how the OpenVZ
>>> folks view it.
>> OpenVZ as-is in some cases wants some functions to be made global
>> (and if C/R code will be modular, exported). Or probably several
>> iterators added.
>>
>> But it's negligible amount of changes compared to main code.
>
> Here is what C/R code wants from pid allocator.
>
> With the introduction of hierarchical PID namespaces, struct pid can
> have not one but many numbers -- tuple (pid_0, pid_1, ..., pid_N),
> where pid_i is pid number in pid_ns which has level i.
>
> Now root pid_ns of container has level n -- numbers from level n to N
> inclusively should be dumped and restored.
>
> During struct pid creation first n-1 numbers can be anything, because the're
> outside of pid_ns, but the rest should be the same.
>
> Code will be ifdeffed and commented, but anyhow, this is an example of
> change C/R will require from the rest of the kernel.
>
>
>
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -182,6 +182,34 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
> return -1;
> }
>
> +static int set_pidmap(struct pid_namespace *pid_ns, pid_t pid)
> +{
> + int offset;
> + struct pidmap *map;
> +
> + offset = pid & BITS_PER_PAGE_MASK;
> + map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
> + if (unlikely(!map->page)) {
> + void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
> + /*
> + * Free the page if someone raced with us
> + * installing it:
> + */
> + spin_lock_irq(&pidmap_lock);
> + if (map->page)
> + kfree(page);
> + else
> + map->page = page;
> + spin_unlock_irq(&pidmap_lock);
> + if (unlikely(!map->page))
> + return -ENOMEM;
> + }
> + if (test_and_set_bit(offset, map->page))
> + return -EBUSY;
> + atomic_dec(&map->nr_free);
> + return pid;
> +}
> +
> int next_pidmap(struct pid_namespace *pid_ns, int last)
> {
> int offset;
> @@ -239,7 +267,7 @@ void free_pid(struct pid *pid)
> call_rcu(&pid->rcu, delayed_put_pid);
> }
>
> -struct pid *alloc_pid(struct pid_namespace *ns)
> +struct pid *alloc_pid(struct pid_namespace *ns, int *cr_nr, unsigned int cr_level)
> {
> struct pid *pid;
> enum pid_type type;
> @@ -253,7 +281,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>
> tmp = ns;
> for (i = ns->level; i >= 0; i--) {
> - nr = alloc_pidmap(tmp);
> + if (cr_nr && ns->level - i <= cr_level)
> + nr = set_pidmap(tmp, cr_nr[ns->level - i]);
> + else
> + nr = alloc_pidmap(tmp);
> if (nr < 0)
> goto out_free;

This patch supposes that the process is restored in a state which took several
clone(CLONE_NEWPID) to reach. if you replay these clone(), which is what restart
is at the end : an optimized replay, you would only need something like below.



Index: 2.6.git/kernel/pid.c
===================================================================
--- 2.6.git.orig/kernel/pid.c
+++ 2.6.git/kernel/pid.c
@@ -122,12 +122,12 @@ static void free_pidmap(struct upid *upi
atomic_inc(&map->nr_free);
}

-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int alloc_pidmap(struct pid_namespace *pid_ns, pid_t upid)
{
int i, offset, max_scan, pid, last = pid_ns->last_pid;
struct pidmap *map;

- pid = last + 1;
+ pid = upid ? upid : last + 1;
if (pid >= pid_max)
pid = RESERVED_PIDS;
offset = pid & BITS_PER_PAGE_MASK;
@@ -239,7 +239,7 @@ void free_pid(struct pid *pid)
call_rcu(&pid->rcu, delayed_put_pid);
}

-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t next_pid)
{
struct pid *pid;
enum pid_type type;
@@ -253,10 +253,15 @@ struct pid *alloc_pid(struct pid_namespa

tmp = ns;
for (i = ns->level; i >= 0; i--) {
- nr = alloc_pidmap(tmp);
+ nr = alloc_pidmap(tmp, next_pid);
if (nr < 0)
goto out_free;

+ /* The next_pid is only applicable for the ns namespace, not
+ * its parents.
+ */
+ next_pid = 0;
+
pid->numbers[i].nr = nr;
pid->numbers[i].ns = tmp;
tmp = tmp->parent;





Well, that's how we do it but I'm not against your patch. It fits our need also.
It's just a bit intrusive for the pid bitmap. if we mix both path, we get something
like this fake patch, which is a bit less intrusive IMO. not tested though.



@@ -122,12 +122,12 @@ static void free_pidmap(struct upid *upi
atomic_inc(&map->nr_free);
}

-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int alloc_pidmap(struct pid_namespace *pid_ns, pid_t upid)
{
int i, offset, max_scan, pid, last = pid_ns->last_pid;
struct pidmap *map;

- pid = last + 1;
+ pid = upid ? upid : last + 1;
if (pid >= pid_max)
pid = RESERVED_PIDS;
offset = pid & BITS_PER_PAGE_MASK;


@@ -239,7 +267,7 @@ void free_pid(struct pid *pid)
call_rcu(&pid->rcu, delayed_put_pid);
}

-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, int *cr_nr, unsigned int cr_level)
{
struct pid *pid;
enum pid_type type;
@@ -253,7 +281,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)

tmp = ns;
for (i = ns->level; i >= 0; i--) {
- nr = alloc_pidmap(tmp);
+ if (cr_nr && ns->level - i <= cr_level)
+ nr = alloc_pidmap(tmp, cr_nr[ns->level - i]);
+ if (nr != cr_nr[ns->level - i])
+ return -EBUSY;
+ else
+ nr = alloc_pidmap(tmp);
if (nr < 0)
goto out_free;

2009-03-12 14:59:46

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

Quoting Cedric Le Goater ([email protected]):
> Alexey Dobriyan wrote:
> > On Thu, Feb 26, 2009 at 06:57:55PM +0300, Alexey Dobriyan wrote:
> >> On Thu, Feb 12, 2009 at 03:04:05PM -0800, Dave Hansen wrote:
> >>> dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... kernel/cpt/ | diffstat
> >
> >>> 47 files changed, 20702 insertions(+)
> >>>
> >>> One important thing that leaves out is the interaction that this code
> >>> has with the rest of the kernel. That's critically important when
> >>> considering long-term maintenance, and I'd be curious how the OpenVZ
> >>> folks view it.
> >> OpenVZ as-is in some cases wants some functions to be made global
> >> (and if C/R code will be modular, exported). Or probably several
> >> iterators added.
> >>
> >> But it's negligible amount of changes compared to main code.
> >
> > Here is what C/R code wants from pid allocator.
> >
> > With the introduction of hierarchical PID namespaces, struct pid can
> > have not one but many numbers -- tuple (pid_0, pid_1, ..., pid_N),
> > where pid_i is pid number in pid_ns which has level i.
> >
> > Now root pid_ns of container has level n -- numbers from level n to N
> > inclusively should be dumped and restored.
> >
> > During struct pid creation first n-1 numbers can be anything, because the're
> > outside of pid_ns, but the rest should be the same.
> >
> > Code will be ifdeffed and commented, but anyhow, this is an example of
> > change C/R will require from the rest of the kernel.
> >
> >
> >
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -182,6 +182,34 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
> > return -1;
> > }
> >
> > +static int set_pidmap(struct pid_namespace *pid_ns, pid_t pid)
> > +{
> > + int offset;
> > + struct pidmap *map;
> > +
> > + offset = pid & BITS_PER_PAGE_MASK;
> > + map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
> > + if (unlikely(!map->page)) {
> > + void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
> > + /*
> > + * Free the page if someone raced with us
> > + * installing it:
> > + */
> > + spin_lock_irq(&pidmap_lock);
> > + if (map->page)
> > + kfree(page);
> > + else
> > + map->page = page;
> > + spin_unlock_irq(&pidmap_lock);
> > + if (unlikely(!map->page))
> > + return -ENOMEM;
> > + }
> > + if (test_and_set_bit(offset, map->page))
> > + return -EBUSY;
> > + atomic_dec(&map->nr_free);
> > + return pid;
> > +}
> > +
> > int next_pidmap(struct pid_namespace *pid_ns, int last)
> > {
> > int offset;
> > @@ -239,7 +267,7 @@ void free_pid(struct pid *pid)
> > call_rcu(&pid->rcu, delayed_put_pid);
> > }
> >
> > -struct pid *alloc_pid(struct pid_namespace *ns)
> > +struct pid *alloc_pid(struct pid_namespace *ns, int *cr_nr, unsigned int cr_level)
> > {
> > struct pid *pid;
> > enum pid_type type;
> > @@ -253,7 +281,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> >
> > tmp = ns;
> > for (i = ns->level; i >= 0; i--) {
> > - nr = alloc_pidmap(tmp);
> > + if (cr_nr && ns->level - i <= cr_level)
> > + nr = set_pidmap(tmp, cr_nr[ns->level - i]);
> > + else
> > + nr = alloc_pidmap(tmp);
> > if (nr < 0)
> > goto out_free;
>
> This patch supposes that the process is restored in a state which took several
> clone(CLONE_NEWPID) to reach. if you replay these clone(), which is what restart
> is at the end : an optimized replay, you would only need something like below.

No, what you're suggesting does not suffice.

Call
(5591,3,1) the task knows as 5591 in the init_pid_ns, 3 in a child pid
ns, and 1 in grandchild pid_ns created from there. Now assume we are
checkpointing tasks T1=(5592,1), and T2=(5594,3,1).

We don't care about the first number in the tuples, so they will be
random numbers after the recreate. But we do care about the second
numbers. But specifying CLONE_NEWPID while recreating the process tree
in userspace does not allow you to specify the 3 in (5594,3,1).

Or are you suggesting that you'll do a dummy clone of (5594,2) so that
the next clone(CLONE_NEWPID) will be expected to be (5594,3,1)?

-serge

2009-03-12 21:02:58

by Greg Kurz

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

On Thu, 2009-03-12 at 09:53 -0500, Serge E. Hallyn wrote:
> Or are you suggesting that you'll do a dummy clone of (5594,2) so that
> the next clone(CLONE_NEWPID) will be expected to be (5594,3,1)?
>

Of course not but one should be able to tell clone() to pick a specific
pid.

--
Gregory Kurz [email protected]
Software Engineer @ IBM/Meiosys http://www.ibm.com
Tel +33 (0)534 638 479 Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
Alan Moore.

2009-03-12 21:21:42

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

Quoting Greg Kurz ([email protected]):
> On Thu, 2009-03-12 at 09:53 -0500, Serge E. Hallyn wrote:
> > Or are you suggesting that you'll do a dummy clone of (5594,2) so that
> > the next clone(CLONE_NEWPID) will be expected to be (5594,3,1)?
> >
>
> Of course not

Ok - someone *did* argue that at some point I think...

> but one should be able to tell clone() to pick a specific
> pid.

Can you explain exactly how? I must be missing something clever.

-serge

2009-03-13 04:29:55

by Ying Han

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

Hi Serge:
I made a patch based on Oren's tree recently which implement a new
syscall clone_with_pid. I tested with checkpoint/restart process tree
and it works as expected.
This patch has some hack in it which i made a copy of libc's clone and
made modifications of passing one more argument(pid number). I will
try to clean up the code and do more testing.

New syscall clone_with_pid
Implement a new syscall which clone a thread with a preselected pid number.

clone_with_pid(child_func, child_stack + CHILD_STACK - 16,
CLONE_WITH_PID|SIGCHLD, pid, NULL);

Signed-off-by: Ying Han <[email protected]>

diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
index 87803da..b5a1b03 100644
--- a/arch/x86/include/asm/syscalls.h
+++ b/arch/x86/include/asm/syscalls.h
@@ -26,6 +26,7 @@ asmlinkage int sys_fork(struct pt_regs);
asmlinkage int sys_clone(struct pt_regs);
asmlinkage int sys_vfork(struct pt_regs);
asmlinkage int sys_execve(struct pt_regs);
+asmlinkage int sys_clone_with_pid(struct pt_regs);

/* kernel/signal_32.c */
asmlinkage int sys_sigsuspend(int, int, old_sigset_t);
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32
index a5f9e09..f10ca0e 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -340,6 +340,7 @@
#define __NR_inotify_init1 332
#define __NR_checkpoint 333
#define __NR_restart 334
+#define __NR_clone_with_pid 335

#ifdef __KERNEL__

diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 0a1302f..88ae634 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -8,7 +8,6 @@
/*
* This file handles the architecture-dependent parts of process handling..
*/
-
#include <stdarg.h>

#include <linux/cpu.h>
@@ -652,6 +651,28 @@ asmlinkage int sys_clone(struct pt_regs regs)
return do_fork(clone_flags, newsp, &regs, 0, parent_tidptr, child_tidptr);
}

+/**
+ * sys_clone_with_pid - clone a thread with pre-select pid number.
+ */
+asmlinkage int sys_clone_with_pid(struct pt_regs regs)
+{
+ unsigned long clone_flags;
+ unsigned long newsp;
+ int __user *parent_tidptr, *child_tidptr;
+ pid_t pid_nr;
+
+ clone_flags = regs.bx;
+ newsp = regs.cx;
+ parent_tidptr = (int __user *)regs.dx;
+ child_tidptr = (int __user *)regs.di;
+ pid_nr = regs.bp;
+
+ if (!newsp)
+ newsp = regs.sp;
+ return do_fork(clone_flags, newsp, &regs, pid_nr, parent_tidptr,
+ child_tidptr);
+}
+
/*
* This is trivial, and on the face of it looks like it
* could equally well be done in user mode.
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_tabl
index 5543136..5191117 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -334,3 +334,4 @@ ENTRY(sys_call_table)
.long sys_inotify_init1
.long sys_checkpoint
.long sys_restart
+ .long sys_clone_with_pid
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 50bde9a..a4aee65 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -7,7 +7,6 @@
* License. See the file COPYING in the main directory of the Linux
* distribution for more details.
*/
-
#include <asm/desc.h>
#include <asm/i387.h>

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 64155de..b7de611 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -8,6 +8,7 @@
* distribution for more details.
*/

+#define DEBUG
#include <linux/version.h>
#include <linux/sched.h>
#include <linux/ptrace.h>
@@ -564,3 +565,4 @@ int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
out:
return ret;
}
+
diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
index e3097ac..a8c5ad5 100644
--- a/checkpoint/ckpt_file.c
+++ b/checkpoint/ckpt_file.c
@@ -7,7 +7,7 @@
* License. See the file COPYING in the main directory of the Linux
* distribution for more details.
*/
-
+#define DEBUG
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/file.h>
diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
index 4925ff2..ca5840b 100644
--- a/checkpoint/ckpt_mem.c
+++ b/checkpoint/ckpt_mem.c
@@ -7,7 +7,7 @@
* License. See the file COPYING in the main directory of the Linux
* distribution for more details.
*/
-
+#define DEBUG
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/slab.h>
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 7ec4de4..30e43c2 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -8,6 +8,7 @@
* distribution for more details.
*/

+#define DEBUG
#include <linux/version.h>
#include <linux/sched.h>
#include <linux/wait.h>
@@ -242,7 +243,7 @@ static int cr_read_task_struct(struct cr_ctx *ctx)
memcpy(t->comm, buf, min(hh->task_comm_len, TASK_COMM_LEN));
}
kfree(buf);
-
+ pr_debug("read task %s\n", t->comm);
/* FIXME: restore remaining relevant task_struct fields */
out:
cr_hbuf_put(ctx, sizeof(*hh));
diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
index f44b081..755e40e 100644
--- a/checkpoint/rstr_file.c
+++ b/checkpoint/rstr_file.c
@@ -7,7 +7,7 @@
* License. See the file COPYING in the main directory of the Linux
* distribution for more details.
*/
-
+#define DEBUG
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/fs.h>
diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
index 4d5ce1a..8330468 100644
--- a/checkpoint/rstr_mem.c
+++ b/checkpoint/rstr_mem.c
@@ -7,7 +7,7 @@
* License. See the file COPYING in the main directory of the Linux
* distribution for more details.
*/
-
+#define DEBUG
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/fcntl.h>
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index f26b0c6..d1a5394 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -7,7 +7,7 @@
* License. See the file COPYING in the main directory of the Linux
* distribution for more details.
*/
-
+#define DEBUG
#include <linux/sched.h>
#include <linux/nsproxy.h>
#include <linux/kernel.h>
@@ -263,7 +263,6 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned
return PTR_ERR(ctx);

ret = do_checkpoint(ctx, pid);
-
if (!ret)
ret = ctx->crid;

@@ -304,3 +303,4 @@ asmlinkage long sys_restart(int crid, int fd, unsigned lon
cr_ctx_put(ctx);
return ret;
}
+
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 217cf6e..bc2c202 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -114,7 +114,6 @@ extern int cr_write_files(struct cr_ctx *ctx, struct task_
extern int do_restart(struct cr_ctx *ctx, pid_t pid);
extern int cr_read_mm(struct cr_ctx *ctx);
extern int cr_read_files(struct cr_ctx *ctx);
-
#ifdef pr_fmt
#undef pr_fmt
#endif
diff --git a/include/linux/pid.h b/include/linux/pid.h
index d7e98ff..86e2f61 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
int next_pidmap(struct pid_namespace *pid_ns, int last);

-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t pid_nr);
extern void free_pid(struct pid *pid);

/*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0150e90..7fb4e28 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -28,6 +28,7 @@
#define CLONE_NEWPID 0x20000000 /* New pid namespace */
#define CLONE_NEWNET 0x40000000 /* New network namespace */
#define CLONE_IO 0x80000000 /* Clone io context */
+#define CLONE_WITH_PID 0x00001000 /* Clone with pre-select PID */

/*
* Scheduling policies
diff --git a/kernel/exit.c b/kernel/exit.c
index 2d8be7e..4baf651 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -3,7 +3,7 @@
*
* Copyright (C) 1991, 1992 Linus Torvalds
*/
-
+#define DEBUG
#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/interrupt.h>
@@ -1676,6 +1676,7 @@ static long do_wait(enum pid_type type, struct pid *pid,
DECLARE_WAITQUEUE(wait, current);
struct task_struct *tsk;
int retval;
+ int level;

trace_sched_process_wait(pid);

@@ -1708,7 +1709,6 @@ repeat:
retval = tsk_result;
goto end;
}
-
if (options & __WNOTHREAD)
break;
tsk = next_thread(tsk);
@@ -1817,7 +1817,6 @@ asmlinkage long sys_wait4(pid_t upid, int __user *stat_a
type = PIDTYPE_PID;
pid = find_get_pid(upid);
}
-
ret = do_wait(type, pid, options | WEXITED, NULL, stat_addr, ru);
put_pid(pid);

diff --git a/kernel/fork.c b/kernel/fork.c
index 085ce56..262ae1e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -10,7 +10,7 @@
* Fork is rather simple, once you get the hang of it, but the memory
* management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
*/
-
+#define DEBUG
#include <linux/slab.h>
#include <linux/init.h>
#include <linux/unistd.h>
@@ -959,10 +959,19 @@ static struct task_struct *copy_process(unsigned long cl
int retval;
struct task_struct *p;
int cgroup_callbacks_done = 0;
+ pid_t clone_pid = stack_size;

if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);

+ /* We only allow the clone_with_pid when a new pid namespace is
+ * created. FIXME: how to restrict it.
+ */
+ if ((clone_flags & CLONE_NEWPID) && (clone_flags & CLONE_WITH_PID))
+ return ERR_PTR(-EINVAL);
+ if ((clone_flags & CLONE_WITH_PID) && (clone_pid <= 1))
+ return ERR_PTR(-EINVAL);
+
/*
* Thread groups must share signals as well, and detached threads
* can only be started up within the thread group.
@@ -1135,7 +1144,10 @@ static struct task_struct *copy_process(unsigned long c

if (pid != &init_struct_pid) {
retval = -ENOMEM;
- pid = alloc_pid(task_active_pid_ns(p));
+ if (clone_flags & CLONE_WITH_PID)
+ pid = alloc_pid(task_active_pid_ns(p), clone_pid);
+ else
+ pid = alloc_pid(task_active_pid_ns(p), 0);
if (!pid)
goto bad_fork_cleanup_io;

@@ -1162,6 +1174,8 @@ static struct task_struct *copy_process(unsigned long cl
* Clear TID on mm_release()?
*/
p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr: NU
+
+
#ifdef CONFIG_FUTEX
p->robust_list = NULL;
#ifdef CONFIG_COMPAT
diff --git a/kernel/pid.c b/kernel/pid.c
index 064e76a..0facf05 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -25,7 +25,7 @@
* Many thanks to Oleg Nesterov for comments and help
*
*/
-
+#define DEBUG
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/slab.h>
@@ -122,12 +122,15 @@ static void free_pidmap(struct upid *upid)
atomic_inc(&map->nr_free);
}

-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int alloc_pidmap(struct pid_namespace *pid_ns, pid_t pid_nr)
{
int i, offset, max_scan, pid, last = pid_ns->last_pid;
struct pidmap *map;

- pid = last + 1;
+ if (pid_nr)
+ pid = pid_nr;
+ else
+ pid = last + 1;
if (pid >= pid_max)
pid = RESERVED_PIDS;
offset = pid & BITS_PER_PAGE_MASK;
@@ -153,9 +156,12 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
do {
if (!test_and_set_bit(offset, map->page)) {
atomic_dec(&map->nr_free);
- pid_ns->last_pid = pid;
+ if (!pid_nr)
+ pid_ns->last_pid = pid;
return pid;
}
+ if (pid_nr)
+ return -1;
offset = find_next_offset(map, offset);
pid = mk_pid(pid_ns, map, offset);
/*
@@ -239,21 +245,25 @@ void free_pid(struct pid *pid)
call_rcu(&pid->rcu, delayed_put_pid);
}

-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t pid_nr)
{
struct pid *pid;
enum pid_type type;
int i, nr;
struct pid_namespace *tmp;
struct upid *upid;
+ int level = ns->level;
+
+ if (pid_nr >= pid_max)
+ return NULL;

pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
if (!pid)
goto out;

- tmp = ns;
- for (i = ns->level; i >= 0; i--) {
- nr = alloc_pidmap(tmp);
+ tmp = ns->parent;
+ for (i = level-1; i >= 0; i--) {
+ nr = alloc_pidmap(tmp, 0);
if (nr < 0)
goto out_free;

@@ -262,6 +272,14 @@ struct pid *alloc_pid(struct pid_namespace *ns)
tmp = tmp->parent;
}

+ nr = alloc_pidmap(ns, pid_nr);
+ if (nr < 0)
+ goto out_free;
+ pid->numbers[level].nr = nr;
+ pid->numbers[level].ns = ns;
+ if (nr == pid_nr)
+ pr_debug("nr == pid_nr == %d\n", nr);
+
get_pid_ns(ns);
pid->level = ns->level;
atomic_set(&pid->count, 1);








On Thu, Mar 12, 2009 at 2:21 PM, Serge E. Hallyn <[email protected]> wrote:
>
> Quoting Greg Kurz ([email protected]):
> > On Thu, 2009-03-12 at 09:53 -0500, Serge E. Hallyn wrote:
> > > Or are you suggesting that you'll do a dummy clone of (5594,2) so that
> > > the next clone(CLONE_NEWPID) will be expected to be (5594,3,1)?
> > >
> >
> > Of course not
>
> Ok - someone *did* argue that at some point I think...
>
> > but one should be able to tell clone() to pick a specific
> > pid.
>
> Can you explain exactly how? ?I must be missing something clever.
>
> -serge
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. ?For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-03-13 05:37:07

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

Ying Han [[email protected]] wrote:
| Hi Serge:
| I made a patch based on Oren's tree recently which implement a new
| syscall clone_with_pid. I tested with checkpoint/restart process tree
| and it works as expected.

Yes, I think we had a version of clone() with pid a while ago.

But it would be easier to review if you break it up into smaller
patches. and remove the unnecessary diffs in this patch like...


| This patch has some hack in it which i made a copy of libc's clone and
| made modifications of passing one more argument(pid number). I will
| try to clean up the code and do more testing.
|
| New syscall clone_with_pid
| Implement a new syscall which clone a thread with a preselected pid number.
|
| clone_with_pid(child_func, child_stack + CHILD_STACK - 16,
| CLONE_WITH_PID|SIGCHLD, pid, NULL);
|
| Signed-off-by: Ying Han <[email protected]>
|
| diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
| index 87803da..b5a1b03 100644
| --- a/arch/x86/include/asm/syscalls.h
| +++ b/arch/x86/include/asm/syscalls.h
| @@ -26,6 +26,7 @@ asmlinkage int sys_fork(struct pt_regs);
| asmlinkage int sys_clone(struct pt_regs);
| asmlinkage int sys_vfork(struct pt_regs);
| asmlinkage int sys_execve(struct pt_regs);
| +asmlinkage int sys_clone_with_pid(struct pt_regs);
|
| /* kernel/signal_32.c */
| asmlinkage int sys_sigsuspend(int, int, old_sigset_t);
| diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32
| index a5f9e09..f10ca0e 100644
| --- a/arch/x86/include/asm/unistd_32.h
| +++ b/arch/x86/include/asm/unistd_32.h
| @@ -340,6 +340,7 @@
| #define __NR_inotify_init1 332
| #define __NR_checkpoint 333
| #define __NR_restart 334
| +#define __NR_clone_with_pid 335
|
| #ifdef __KERNEL__
|
| diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
| index 0a1302f..88ae634 100644
| --- a/arch/x86/kernel/process_32.c
| +++ b/arch/x86/kernel/process_32.c
| @@ -8,7 +8,6 @@
| /*
| * This file handles the architecture-dependent parts of process handling..
| */
| -

these

| #include <stdarg.h>
|
| #include <linux/cpu.h>
| @@ -652,6 +651,28 @@ asmlinkage int sys_clone(struct pt_regs regs)
| return do_fork(clone_flags, newsp, &regs, 0, parent_tidptr, child_tidptr);
| }
|
| +/**
| + * sys_clone_with_pid - clone a thread with pre-select pid number.
| + */
| +asmlinkage int sys_clone_with_pid(struct pt_regs regs)
| +{
| + unsigned long clone_flags;
| + unsigned long newsp;
| + int __user *parent_tidptr, *child_tidptr;
| + pid_t pid_nr;
| +
| + clone_flags = regs.bx;
| + newsp = regs.cx;
| + parent_tidptr = (int __user *)regs.dx;
| + child_tidptr = (int __user *)regs.di;
| + pid_nr = regs.bp;
| +
| + if (!newsp)
| + newsp = regs.sp;
| + return do_fork(clone_flags, newsp, &regs, pid_nr, parent_tidptr,
| + child_tidptr);
| +}
| +
| /*
| * This is trivial, and on the face of it looks like it
| * could equally well be done in user mode.
| diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_tabl
| index 5543136..5191117 100644
| --- a/arch/x86/kernel/syscall_table_32.S
| +++ b/arch/x86/kernel/syscall_table_32.S
| @@ -334,3 +334,4 @@ ENTRY(sys_call_table)
| .long sys_inotify_init1
| .long sys_checkpoint
| .long sys_restart
| + .long sys_clone_with_pid
| diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
| index 50bde9a..a4aee65 100644
| --- a/arch/x86/mm/checkpoint.c
| +++ b/arch/x86/mm/checkpoint.c
| @@ -7,7 +7,6 @@
| * License. See the file COPYING in the main directory of the Linux
| * distribution for more details.
| */
| -
| #include <asm/desc.h>
| #include <asm/i387.h>
|
| diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
| index 64155de..b7de611 100644
| --- a/checkpoint/checkpoint.c
| +++ b/checkpoint/checkpoint.c
| @@ -8,6 +8,7 @@
| * distribution for more details.
| */
|
| +#define DEBUG
| #include <linux/version.h>
| #include <linux/sched.h>
| #include <linux/ptrace.h>
| @@ -564,3 +565,4 @@ int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
| out:
| return ret;
| }
| +
| diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
| index e3097ac..a8c5ad5 100644
| --- a/checkpoint/ckpt_file.c
| +++ b/checkpoint/ckpt_file.c
| @@ -7,7 +7,7 @@
| * License. See the file COPYING in the main directory of the Linux
| * distribution for more details.
| */
| -
| +#define DEBUG
| #include <linux/kernel.h>
| #include <linux/sched.h>
| #include <linux/file.h>
| diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
| index 4925ff2..ca5840b 100644
| --- a/checkpoint/ckpt_mem.c
| +++ b/checkpoint/ckpt_mem.c
| @@ -7,7 +7,7 @@
| * License. See the file COPYING in the main directory of the Linux
| * distribution for more details.
| */
| -
| +#define DEBUG
| #include <linux/kernel.h>
| #include <linux/sched.h>
| #include <linux/slab.h>
| diff --git a/checkpoint/restart.c b/checkpoint/restart.c
| index 7ec4de4..30e43c2 100644
| --- a/checkpoint/restart.c
| +++ b/checkpoint/restart.c
| @@ -8,6 +8,7 @@
| * distribution for more details.
| */
|
| +#define DEBUG
| #include <linux/version.h>
| #include <linux/sched.h>
| #include <linux/wait.h>
| @@ -242,7 +243,7 @@ static int cr_read_task_struct(struct cr_ctx *ctx)
| memcpy(t->comm, buf, min(hh->task_comm_len, TASK_COMM_LEN));
| }
| kfree(buf);
| -
| + pr_debug("read task %s\n", t->comm);
| /* FIXME: restore remaining relevant task_struct fields */
| out:
| cr_hbuf_put(ctx, sizeof(*hh));
| diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
| index f44b081..755e40e 100644
| --- a/checkpoint/rstr_file.c
| +++ b/checkpoint/rstr_file.c
| @@ -7,7 +7,7 @@
| * License. See the file COPYING in the main directory of the Linux
| * distribution for more details.
| */
| -
| +#define DEBUG
| #include <linux/kernel.h>
| #include <linux/sched.h>
| #include <linux/fs.h>
| diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
| index 4d5ce1a..8330468 100644
| --- a/checkpoint/rstr_mem.c
| +++ b/checkpoint/rstr_mem.c
| @@ -7,7 +7,7 @@
| * License. See the file COPYING in the main directory of the Linux
| * distribution for more details.
| */
| -
| +#define DEBUG
| #include <linux/kernel.h>
| #include <linux/sched.h>
| #include <linux/fcntl.h>
| diff --git a/checkpoint/sys.c b/checkpoint/sys.c
| index f26b0c6..d1a5394 100644
| --- a/checkpoint/sys.c
| +++ b/checkpoint/sys.c
| @@ -7,7 +7,7 @@
| * License. See the file COPYING in the main directory of the Linux
| * distribution for more details.
| */
| -
| +#define DEBUG
| #include <linux/sched.h>
| #include <linux/nsproxy.h>
| #include <linux/kernel.h>
| @@ -263,7 +263,6 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned
| return PTR_ERR(ctx);
|
| ret = do_checkpoint(ctx, pid);
| -
| if (!ret)
| ret = ctx->crid;
|
| @@ -304,3 +303,4 @@ asmlinkage long sys_restart(int crid, int fd, unsigned lon
| cr_ctx_put(ctx);
| return ret;
| }
| +
| diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
| index 217cf6e..bc2c202 100644
| --- a/include/linux/checkpoint.h
| +++ b/include/linux/checkpoint.h
| @@ -114,7 +114,6 @@ extern int cr_write_files(struct cr_ctx *ctx, struct task_
| extern int do_restart(struct cr_ctx *ctx, pid_t pid);
| extern int cr_read_mm(struct cr_ctx *ctx);
| extern int cr_read_files(struct cr_ctx *ctx);
| -
| #ifdef pr_fmt
| #undef pr_fmt
| #endif
| diff --git a/include/linux/pid.h b/include/linux/pid.h
| index d7e98ff..86e2f61 100644
| --- a/include/linux/pid.h
| +++ b/include/linux/pid.h
| @@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
| extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
| int next_pidmap(struct pid_namespace *pid_ns, int last);
|
| -extern struct pid *alloc_pid(struct pid_namespace *ns);
| +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t pid_nr);
| extern void free_pid(struct pid *pid);
|
| /*
| diff --git a/include/linux/sched.h b/include/linux/sched.h
| index 0150e90..7fb4e28 100644
| --- a/include/linux/sched.h
| +++ b/include/linux/sched.h
| @@ -28,6 +28,7 @@
| #define CLONE_NEWPID 0x20000000 /* New pid namespace */
| #define CLONE_NEWNET 0x40000000 /* New network namespace */
| #define CLONE_IO 0x80000000 /* Clone io context */
| +#define CLONE_WITH_PID 0x00001000 /* Clone with pre-select PID */
|
| /*
| * Scheduling policies
| diff --git a/kernel/exit.c b/kernel/exit.c
| index 2d8be7e..4baf651 100644
| --- a/kernel/exit.c
| +++ b/kernel/exit.c
| @@ -3,7 +3,7 @@
| *
| * Copyright (C) 1991, 1992 Linus Torvalds
| */
| -
| +#define DEBUG
| #include <linux/mm.h>
| #include <linux/slab.h>
| #include <linux/interrupt.h>
| @@ -1676,6 +1676,7 @@ static long do_wait(enum pid_type type, struct pid *pid,
| DECLARE_WAITQUEUE(wait, current);
| struct task_struct *tsk;
| int retval;
| + int level;

and this (level is not used).
|
| trace_sched_process_wait(pid);
|
| @@ -1708,7 +1709,6 @@ repeat:
| retval = tsk_result;
| goto end;
| }
| -
| if (options & __WNOTHREAD)
| break;
| tsk = next_thread(tsk);
| @@ -1817,7 +1817,6 @@ asmlinkage long sys_wait4(pid_t upid, int __user *stat_a
| type = PIDTYPE_PID;
| pid = find_get_pid(upid);
| }
| -
| ret = do_wait(type, pid, options | WEXITED, NULL, stat_addr, ru);
| put_pid(pid);
|
| diff --git a/kernel/fork.c b/kernel/fork.c
| index 085ce56..262ae1e 100644
| --- a/kernel/fork.c
| +++ b/kernel/fork.c
| @@ -10,7 +10,7 @@
| * Fork is rather simple, once you get the hang of it, but the memory
| * management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
| */
| -
| +#define DEBUG
| #include <linux/slab.h>
| #include <linux/init.h>
| #include <linux/unistd.h>
| @@ -959,10 +959,19 @@ static struct task_struct *copy_process(unsigned long cl
| int retval;
| struct task_struct *p;
| int cgroup_callbacks_done = 0;
| + pid_t clone_pid = stack_size;
|
| if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
| return ERR_PTR(-EINVAL);
|
| + /* We only allow the clone_with_pid when a new pid namespace is
| + * created. FIXME: how to restrict it.

Not sure why CLONE_NEWPID is required to set pid_nr. In fact with CLONE_NEWPID,
by definition, pid_nr should be 1. Also, what happens if a container has
more than one process - where the second process has a pid_nr > 2 ?

2009-03-13 06:20:08

by Ying Han

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

Thank you Sukadev for your comments. I will try to clean up my patch
and repost it.

--Ying

On Thu, Mar 12, 2009 at 10:34 PM, Sukadev Bhattiprolu
<[email protected]> wrote:
> Ying Han [[email protected]] wrote:
> | Hi Serge:
> | I made a patch based on Oren's tree recently which implement a new
> | syscall clone_with_pid. I tested with checkpoint/restart process tree
> | and it works as expected.
>
> Yes, I think we had a version of clone() with pid a while ago.
>
> But it would be easier to review if you break it up into smaller
> patches. and remove the unnecessary diffs in this patch like...
>
>
> | This patch has some hack in it which i made a copy of libc's clone and
> | made modifications of passing one more argument(pid number). I will
> | try to clean up the code and do more testing.
> |
> | New syscall clone_with_pid
> | Implement a new syscall which clone a thread with a preselected pid number.
> |
> | clone_with_pid(child_func, child_stack + CHILD_STACK - 16,
> | ? ? ? ? ? ? ? ? ? ? ? CLONE_WITH_PID|SIGCHLD, pid, NULL);
> |
> | Signed-off-by: Ying Han <[email protected]>
> |
> | diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
> | index 87803da..b5a1b03 100644
> | --- a/arch/x86/include/asm/syscalls.h
> | +++ b/arch/x86/include/asm/syscalls.h
> | @@ -26,6 +26,7 @@ asmlinkage int sys_fork(struct pt_regs);
> | ?asmlinkage int sys_clone(struct pt_regs);
> | ?asmlinkage int sys_vfork(struct pt_regs);
> | ?asmlinkage int sys_execve(struct pt_regs);
> | +asmlinkage int sys_clone_with_pid(struct pt_regs);
> |
> | ?/* kernel/signal_32.c */
> | ?asmlinkage int sys_sigsuspend(int, int, old_sigset_t);
> | diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32
> | index a5f9e09..f10ca0e 100644
> | --- a/arch/x86/include/asm/unistd_32.h
> | +++ b/arch/x86/include/asm/unistd_32.h
> | @@ -340,6 +340,7 @@
> | ?#define __NR_inotify_init1 ? 332
> | ?#define __NR_checkpoint ? ? ? ? ? ? ?333
> | ?#define __NR_restart ? ? ? ? 334
> | +#define __NR_clone_with_pid ?335
> |
> | ?#ifdef __KERNEL__
> |
> | diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> | index 0a1302f..88ae634 100644
> | --- a/arch/x86/kernel/process_32.c
> | +++ b/arch/x86/kernel/process_32.c
> | @@ -8,7 +8,6 @@
> | ?/*
> | ? * This file handles the architecture-dependent parts of process handling..
> | ? */
> | -
>
> these
>
> | ?#include <stdarg.h>
> |
> | ?#include <linux/cpu.h>
> | @@ -652,6 +651,28 @@ asmlinkage int sys_clone(struct pt_regs regs)
> | ? ? ? return do_fork(clone_flags, newsp, &regs, 0, parent_tidptr, child_tidptr);
> | ?}
> |
> | +/**
> | + * sys_clone_with_pid - clone a thread with pre-select pid number.
> | + */
> | +asmlinkage int sys_clone_with_pid(struct pt_regs regs)
> | +{
> | + ? ? unsigned long clone_flags;
> | + ? ? unsigned long newsp;
> | + ? ? int __user *parent_tidptr, *child_tidptr;
> | + ? ? pid_t pid_nr;
> | +
> | + ? ? clone_flags = regs.bx;
> | + ? ? newsp = regs.cx;
> | + ? ? parent_tidptr = (int __user *)regs.dx;
> | + ? ? child_tidptr = (int __user *)regs.di;
> | + ? ? pid_nr = regs.bp;
> | +
> | + ? ? if (!newsp)
> | + ? ? ? ? ? ? newsp = regs.sp;
> | + ? ? return do_fork(clone_flags, newsp, &regs, pid_nr, parent_tidptr,
> | + ? ? ? ? ? ? ? ? ? ? child_tidptr);
> | +}
> | +
> | ?/*
> | ? * This is trivial, and on the face of it looks like it
> | ? * could equally well be done in user mode.
> | diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_tabl
> | index 5543136..5191117 100644
> | --- a/arch/x86/kernel/syscall_table_32.S
> | +++ b/arch/x86/kernel/syscall_table_32.S
> | @@ -334,3 +334,4 @@ ENTRY(sys_call_table)
> | ? ? ? .long sys_inotify_init1
> | ? ? ? .long sys_checkpoint
> | ? ? ? .long sys_restart
> | + ? ? .long sys_clone_with_pid
> | diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
> | index 50bde9a..a4aee65 100644
> | --- a/arch/x86/mm/checkpoint.c
> | +++ b/arch/x86/mm/checkpoint.c
> | @@ -7,7 +7,6 @@
> | ? * ?License. ?See the file COPYING in the main directory of the Linux
> | ? * ?distribution for more details.
> | ? */
> | -
> | ?#include <asm/desc.h>
> | ?#include <asm/i387.h>
> |
> | diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
> | index 64155de..b7de611 100644
> | --- a/checkpoint/checkpoint.c
> | +++ b/checkpoint/checkpoint.c
> | @@ -8,6 +8,7 @@
> | ? * ?distribution for more details.
> | ? */
> |
> | +#define DEBUG
> | ?#include <linux/version.h>
> | ?#include <linux/sched.h>
> | ?#include <linux/ptrace.h>
> | @@ -564,3 +565,4 @@ int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
> | ? out:
> | ? ? ? return ret;
> | ?}
> | +
> | diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
> | index e3097ac..a8c5ad5 100644
> | --- a/checkpoint/ckpt_file.c
> | +++ b/checkpoint/ckpt_file.c
> | @@ -7,7 +7,7 @@
> | ? * ?License. ?See the file COPYING in the main directory of the Linux
> | ? * ?distribution for more details.
> | ? */
> | -
> | +#define DEBUG
> | ?#include <linux/kernel.h>
> | ?#include <linux/sched.h>
> | ?#include <linux/file.h>
> | diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
> | index 4925ff2..ca5840b 100644
> | --- a/checkpoint/ckpt_mem.c
> | +++ b/checkpoint/ckpt_mem.c
> | @@ -7,7 +7,7 @@
> | ? * ?License. ?See the file COPYING in the main directory of the Linux
> | ? * ?distribution for more details.
> | ? */
> | -
> | +#define DEBUG
> | ?#include <linux/kernel.h>
> | ?#include <linux/sched.h>
> | ?#include <linux/slab.h>
> | diff --git a/checkpoint/restart.c b/checkpoint/restart.c
> | index 7ec4de4..30e43c2 100644
> | --- a/checkpoint/restart.c
> | +++ b/checkpoint/restart.c
> | @@ -8,6 +8,7 @@
> | ? * ?distribution for more details.
> | ? */
> |
> | +#define DEBUG
> | ?#include <linux/version.h>
> | ?#include <linux/sched.h>
> | ?#include <linux/wait.h>
> | @@ -242,7 +243,7 @@ static int cr_read_task_struct(struct cr_ctx *ctx)
> | ? ? ? ? ? ? ? memcpy(t->comm, buf, min(hh->task_comm_len, TASK_COMM_LEN));
> | ? ? ? }
> | ? ? ? kfree(buf);
> | -
> | + ? ? pr_debug("read task %s\n", t->comm);
> | ? ? ? /* FIXME: restore remaining relevant task_struct fields */
> | ? out:
> | ? ? ? cr_hbuf_put(ctx, sizeof(*hh));
> | diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
> | index f44b081..755e40e 100644
> | --- a/checkpoint/rstr_file.c
> | +++ b/checkpoint/rstr_file.c
> | @@ -7,7 +7,7 @@
> | ? * ?License. ?See the file COPYING in the main directory of the Linux
> | ? * ?distribution for more details.
> | ? */
> | -
> | +#define DEBUG
> | ?#include <linux/kernel.h>
> | ?#include <linux/sched.h>
> | ?#include <linux/fs.h>
> | diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
> | index 4d5ce1a..8330468 100644
> | --- a/checkpoint/rstr_mem.c
> | +++ b/checkpoint/rstr_mem.c
> | @@ -7,7 +7,7 @@
> | ? * ?License. ?See the file COPYING in the main directory of the Linux
> | ? * ?distribution for more details.
> | ? */
> | -
> | +#define DEBUG
> | ?#include <linux/kernel.h>
> | ?#include <linux/sched.h>
> | ?#include <linux/fcntl.h>
> | diff --git a/checkpoint/sys.c b/checkpoint/sys.c
> | index f26b0c6..d1a5394 100644
> | --- a/checkpoint/sys.c
> | +++ b/checkpoint/sys.c
> | @@ -7,7 +7,7 @@
> | ? * ?License. ?See the file COPYING in the main directory of the Linux
> | ? * ?distribution for more details.
> | ? */
> | -
> | +#define DEBUG
> | ?#include <linux/sched.h>
> | ?#include <linux/nsproxy.h>
> | ?#include <linux/kernel.h>
> | @@ -263,7 +263,6 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned
> | ? ? ? ? ? ? ? return PTR_ERR(ctx);
> |
> | ? ? ? ret = do_checkpoint(ctx, pid);
> | -
> | ? ? ? if (!ret)
> | ? ? ? ? ? ? ? ret = ctx->crid;
> |
> | @@ -304,3 +303,4 @@ asmlinkage long sys_restart(int crid, int fd, unsigned lon
> | ? ? ? cr_ctx_put(ctx);
> | ? ? ? return ret;
> | ?}
> | +
> | diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
> | index 217cf6e..bc2c202 100644
> | --- a/include/linux/checkpoint.h
> | +++ b/include/linux/checkpoint.h
> | @@ -114,7 +114,6 @@ extern int cr_write_files(struct cr_ctx *ctx, struct task_
> | ?extern int do_restart(struct cr_ctx *ctx, pid_t pid);
> | ?extern int cr_read_mm(struct cr_ctx *ctx);
> | ?extern int cr_read_files(struct cr_ctx *ctx);
> | -
> | ?#ifdef pr_fmt
> | ?#undef pr_fmt
> | ?#endif
> | diff --git a/include/linux/pid.h b/include/linux/pid.h
> | index d7e98ff..86e2f61 100644
> | --- a/include/linux/pid.h
> | +++ b/include/linux/pid.h
> | @@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
> | ?extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
> | ?int next_pidmap(struct pid_namespace *pid_ns, int last);
> |
> | -extern struct pid *alloc_pid(struct pid_namespace *ns);
> | +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t pid_nr);
> | ?extern void free_pid(struct pid *pid);
> |
> | ?/*
> | diff --git a/include/linux/sched.h b/include/linux/sched.h
> | index 0150e90..7fb4e28 100644
> | --- a/include/linux/sched.h
> | +++ b/include/linux/sched.h
> | @@ -28,6 +28,7 @@
> | ?#define CLONE_NEWPID ? ? ? ? 0x20000000 ? ? ?/* New pid namespace */
> | ?#define CLONE_NEWNET ? ? ? ? 0x40000000 ? ? ?/* New network namespace */
> | ?#define CLONE_IO ? ? ? ? ? ? 0x80000000 ? ? ?/* Clone io context */
> | +#define CLONE_WITH_PID ? ? ? ? ? ? ? 0x00001000 ? ? ?/* Clone with pre-select PID */
> |
> | ?/*
> | ? * Scheduling policies
> | diff --git a/kernel/exit.c b/kernel/exit.c
> | index 2d8be7e..4baf651 100644
> | --- a/kernel/exit.c
> | +++ b/kernel/exit.c
> | @@ -3,7 +3,7 @@
> | ? *
> | ? * ?Copyright (C) 1991, 1992 ?Linus Torvalds
> | ? */
> | -
> | +#define DEBUG
> | ?#include <linux/mm.h>
> | ?#include <linux/slab.h>
> | ?#include <linux/interrupt.h>
> | @@ -1676,6 +1676,7 @@ static long do_wait(enum pid_type type, struct pid *pid,
> | ? ? ? DECLARE_WAITQUEUE(wait, current);
> | ? ? ? struct task_struct *tsk;
> | ? ? ? int retval;
> | + ? ? int level;
>
> and this (level is not used).
> |
> | ? ? ? trace_sched_process_wait(pid);
> |
> | @@ -1708,7 +1709,6 @@ repeat:
> | ? ? ? ? ? ? ? ? ? ? ? retval = tsk_result;
> | ? ? ? ? ? ? ? ? ? ? ? goto end;
> | ? ? ? ? ? ? ? }
> | -
> | ? ? ? ? ? ? ? if (options & __WNOTHREAD)
> | ? ? ? ? ? ? ? ? ? ? ? break;
> | ? ? ? ? ? ? ? tsk = next_thread(tsk);
> | @@ -1817,7 +1817,6 @@ asmlinkage long sys_wait4(pid_t upid, int __user *stat_a
> | ? ? ? ? ? ? ? type = PIDTYPE_PID;
> | ? ? ? ? ? ? ? pid = find_get_pid(upid);
> | ? ? ? }
> | -
> | ? ? ? ret = do_wait(type, pid, options | WEXITED, NULL, stat_addr, ru);
> | ? ? ? put_pid(pid);
> |
> | diff --git a/kernel/fork.c b/kernel/fork.c
> | index 085ce56..262ae1e 100644
> | --- a/kernel/fork.c
> | +++ b/kernel/fork.c
> | @@ -10,7 +10,7 @@
> | ? * Fork is rather simple, once you get the hang of it, but the memory
> | ? * management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
> | ? */
> | -
> | +#define DEBUG
> | ?#include <linux/slab.h>
> | ?#include <linux/init.h>
> | ?#include <linux/unistd.h>
> | @@ -959,10 +959,19 @@ static struct task_struct *copy_process(unsigned long cl
> | ? ? ? int retval;
> | ? ? ? struct task_struct *p;
> | ? ? ? int cgroup_callbacks_done = 0;
> | + ? ? pid_t clone_pid = stack_size;
> |
> | ? ? ? if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
> | ? ? ? ? ? ? ? return ERR_PTR(-EINVAL);
> |
> | + ? ? /* We only allow the clone_with_pid when a new pid namespace is
> | + ? ? ?* created. FIXME: how to restrict it.
>
> Not sure why CLONE_NEWPID is required to set pid_nr. In fact with CLONE_NEWPID,
> by definition, pid_nr should be 1. Also, what happens if a container has
> more than one process - where the second process has a pid_nr > 2 ?
>

2009-03-13 15:27:50

by Cedric Le Goater

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

Subject: [RFC] forkpid() syscall

From: Cedric Le Goater <[email protected]>

let's the user specify a pid to fork and return EBUSY if the pid is
not available.

this patch includes a alloc_pid*() cleanup on the way errors are
returned that could be pushed to mainline independently.

usage :

#include <sys/syscall.h>

#define __NR_forkpid 324

static inline int forkpid(int pid)
{
return syscall(__NR_forkpid, pid);
}

caveats :
fork oriented, should also cover clone
i386 only
does not cover 64 bits clone flags


Signed-off-by: Cedric Le Goater <[email protected]>
---
arch/i386/kernel/process.c | 15 +++++++++++----
arch/i386/kernel/syscall_table.S | 1 +
include/asm-i386/unistd.h | 3 ++-
include/linux/pid.h | 2 +-
include/linux/sched.h | 2 +-
kernel/fork.c | 9 +++++----
kernel/pid.c | 28 +++++++++++++++-------------
7 files changed, 36 insertions(+), 24 deletions(-)

Index: 2.6.22/kernel/pid.c
===================================================================
--- 2.6.22.orig/kernel/pid.c
+++ 2.6.22/kernel/pid.c
@@ -96,12 +96,12 @@ static fastcall void free_pidmap(struct
atomic_inc(&map->nr_free);
}

-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int alloc_pidmap(struct pid_namespace *pid_ns, pid_t upid)
{
int i, offset, max_scan, pid, last = pid_ns->last_pid;
struct pidmap *map;

- pid = last + 1;
+ pid = upid ? upid : last + 1;
if (pid >= pid_max)
pid = RESERVED_PIDS;
offset = pid & BITS_PER_PAGE_MASK;
@@ -130,6 +130,8 @@ static int alloc_pidmap(struct pid_names
pid_ns->last_pid = pid;
return pid;
}
+ if (upid)
+ return -EBUSY;
offset = find_next_offset(map, offset);
pid = mk_pid(pid_ns, map, offset);
/*
@@ -153,7 +155,7 @@ static int alloc_pidmap(struct pid_names
}
pid = mk_pid(pid_ns, map, offset);
}
- return -1;
+ return -EAGAIN;
}

static int next_pidmap(struct pid_namespace *pid_ns, int last)
@@ -203,19 +205,24 @@ fastcall void free_pid(struct pid *pid)
call_rcu(&pid->rcu, delayed_put_pid);
}

-struct pid *alloc_pid(void)
+struct pid *alloc_pid(pid_t upid)
{
struct pid *pid;
enum pid_type type;
int nr = -1;

pid = kmem_cache_alloc(pid_cachep, GFP_KERNEL);
- if (!pid)
+ if (!pid) {
+ pid = ERR_PTR(-ENOMEM);
goto out;
+ }

- nr = alloc_pidmap(current->nsproxy->pid_ns);
- if (nr < 0)
- goto out_free;
+ nr = alloc_pidmap(current->nsproxy->pid_ns, upid);
+ if (nr < 0) {
+ kmem_cache_free(pid_cachep, pid);
+ pid = ERR_PTR(nr);
+ goto out;
+ }

atomic_set(&pid->count, 1);
pid->nr = nr;
@@ -228,11 +235,6 @@ struct pid *alloc_pid(void)

out:
return pid;
-
-out_free:
- kmem_cache_free(pid_cachep, pid);
- pid = NULL;
- goto out;
}

struct pid * fastcall find_pid(int nr)
Index: 2.6.22/arch/i386/kernel/process.c
===================================================================
--- 2.6.22.orig/arch/i386/kernel/process.c
+++ 2.6.22/arch/i386/kernel/process.c
@@ -355,7 +355,7 @@ int kernel_thread(int (*fn)(void *), voi
regs.eflags = X86_EFLAGS_IF | X86_EFLAGS_SF | X86_EFLAGS_PF | 0x2;

/* Ok, create the new process.. */
- return do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL, NULL);
+ return do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL, NULL, 0);
}
EXPORT_SYMBOL(kernel_thread);

@@ -722,9 +722,16 @@ struct task_struct fastcall * __switch_t
return prev_p;
}

+asmlinkage int sys_forkpid(struct pt_regs regs)
+{
+ pid_t pid = regs.ebx;
+
+ return do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL, pid);
+}
+
asmlinkage int sys_fork(struct pt_regs regs)
{
- return do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
+ return do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL, 0);
}

asmlinkage int sys_clone(struct pt_regs regs)
@@ -739,7 +746,7 @@ asmlinkage int sys_clone(struct pt_regs
child_tidptr = (int __user *)regs.edi;
if (!newsp)
newsp = regs.esp;
- return do_fork(clone_flags, newsp, &regs, 0, parent_tidptr, child_tidptr);
+ return do_fork(clone_flags, newsp, &regs, 0, parent_tidptr, child_tidptr, 0);
}

/*
@@ -754,7 +761,7 @@ asmlinkage int sys_clone(struct pt_regs
*/
asmlinkage int sys_vfork(struct pt_regs regs)
{
- return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
+ return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0, NULL, NULL, 0);
}

/*
Index: 2.6.22/arch/i386/kernel/syscall_table.S
===================================================================
--- 2.6.22.orig/arch/i386/kernel/syscall_table.S
+++ 2.6.22/arch/i386/kernel/syscall_table.S
@@ -323,3 +323,4 @@ ENTRY(sys_call_table)
.long sys_signalfd
.long sys_timerfd
.long sys_eventfd
+ .long sys_forkpid
Index: 2.6.22/include/asm-i386/unistd.h
===================================================================
--- 2.6.22.orig/include/asm-i386/unistd.h
+++ 2.6.22/include/asm-i386/unistd.h
@@ -329,10 +329,11 @@
#define __NR_signalfd 321
#define __NR_timerfd 322
#define __NR_eventfd 323
+#define __NR_forkpid 324

#ifdef __KERNEL__

-#define NR_syscalls 324
+#define NR_syscalls 325

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
Index: 2.6.22/kernel/fork.c
===================================================================
--- 2.6.22.orig/kernel/fork.c
+++ 2.6.22/kernel/fork.c
@@ -1358,15 +1358,16 @@ long do_fork(unsigned long clone_flags,
struct pt_regs *regs,
unsigned long stack_size,
int __user *parent_tidptr,
- int __user *child_tidptr)
+ int __user *child_tidptr,
+ pid_t upid)
{
struct task_struct *p;
int trace = 0;
- struct pid *pid = alloc_pid();
+ struct pid *pid = alloc_pid(upid);
long nr;

- if (!pid)
- return -EAGAIN;
+ if (IS_ERR(pid))
+ return PTR_ERR(pid);
nr = pid->nr;
if (unlikely(current->ptrace)) {
trace = fork_traceflag (clone_flags);
Index: 2.6.22/include/linux/sched.h
===================================================================
--- 2.6.22.orig/include/linux/sched.h
+++ 2.6.22/include/linux/sched.h
@@ -1433,7 +1433,7 @@ extern int allow_signal(int);
extern int disallow_signal(int);

extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
-extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *, pid_t);
struct task_struct *fork_idle(int);

extern void set_task_comm(struct task_struct *tsk, char *from);
Index: 2.6.22/include/linux/pid.h
===================================================================
--- 2.6.22.orig/include/linux/pid.h
+++ 2.6.22/include/linux/pid.h
@@ -95,7 +95,7 @@ extern struct pid *FASTCALL(find_pid(int
extern struct pid *find_get_pid(int nr);
extern struct pid *find_ge_pid(int nr);

-extern struct pid *alloc_pid(void);
+extern struct pid *alloc_pid(pid_t upid);
extern void FASTCALL(free_pid(struct pid *pid));

static inline pid_t pid_nr(struct pid *pid)


Attachments:
clone_with_pid.patch (6.98 kB)

2009-03-13 15:47:50

by Cedric Le Goater

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?


> No, what you're suggesting does not suffice.

probably. I'm still trying to understand what you mean below :)

Man, I hate these hierarchicals pid_ns. one level would have been enough,
just one vpid attribute in 'struct pid*'

> Call
> (5591,3,1) the task knows as 5591 in the init_pid_ns, 3 in a child pid
> ns, and 1 in grandchild pid_ns created from there. Now assume we are
> checkpointing tasks T1=(5592,1), and T2=(5594,3,1).
>
> We don't care about the first number in the tuples, so they will be
> random numbers after the recreate.

yes.

> But we do care about the second numbers.

yes very much and we need a way set these numbers in alloc_pid()

> But specifying CLONE_NEWPID while recreating the process tree
> in userspace does not allow you to specify the 3 in (5594,3,1).

I haven't looked closely at hierarchical pid namespaces but as we're
using a an array of pid indexed but the pidns level, i don't see why
it shouldn't be possible. you might be right.

anyway, I think that some CLONE_NEW* should be forbidden. Daniel should
send soon a little patch for the ns_cgroup restricting the clone flags
being used in a container.

Cheers,

C.

> Or are you suggesting that you'll do a dummy clone of (5594,2) so that
> the next clone(CLONE_NEWPID) will be expected to be (5594,3,1)?
>
> -serge

2009-03-13 16:36:28

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

Quoting Cedric Le Goater ([email protected]):
>
> > No, what you're suggesting does not suffice.
>
> probably. I'm still trying to understand what you mean below :)
>
> Man, I hate these hierarchicals pid_ns. one level would have been enough,
> just one vpid attribute in 'struct pid*'

Well I don't mind - temporarily - saying that nested pid namespaces
are not checkpointable. It's just that if we're going to need a new
syscall anyway, then why not go ahead and address the whole problem?
It's not hugely more complicated, and seems worth it.

> > Call
> > (5591,3,1) the task knows as 5591 in the init_pid_ns, 3 in a child pid
> > ns, and 1 in grandchild pid_ns created from there. Now assume we are
> > checkpointing tasks T1=(5592,1), and T2=(5594,3,1).
> >
> > We don't care about the first number in the tuples, so they will be
> > random numbers after the recreate.
>
> yes.
>
> > But we do care about the second numbers.
>
> yes very much and we need a way set these numbers in alloc_pid()
>
> > But specifying CLONE_NEWPID while recreating the process tree
> > in userspace does not allow you to specify the 3 in (5594,3,1).
>
> I haven't looked closely at hierarchical pid namespaces but as we're
> using a an array of pid indexed but the pidns level, i don't see why
> it shouldn't be possible. you might be right.
>
> anyway, I think that some CLONE_NEW* should be forbidden. Daniel should
> send soon a little patch for the ns_cgroup restricting the clone flags
> being used in a container.

Uh, that feels a bit over the top. We want to make this
uncheckpointable (if it remains so), not prevent the whole action.
After all I may be running a container which I don't plan on ever
checkpointing, and inside that container running a job which i do
want to migrate.

So depending on if we're doing the Dave or the rest-of-the-world
way :), we either clear_bit(pidns->may_checkpoint) on the parent
pid_ns when a child is created, or we walk every task being
checkpointed and make sure they each are in the same pid_ns. Doesn't
that suffice?

-serge

2009-03-13 16:55:21

by Cedric Le Goater

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

Serge E. Hallyn wrote:
> Quoting Cedric Le Goater ([email protected]):
>>> No, what you're suggesting does not suffice.
>> probably. I'm still trying to understand what you mean below :)
>>
>> Man, I hate these hierarchicals pid_ns. one level would have been enough,
>> just one vpid attribute in 'struct pid*'
>
> Well I don't mind - temporarily - saying that nested pid namespaces
> are not checkpointable. It's just that if we're going to need a new
> syscall anyway, then why not go ahead and address the whole problem?
> It's not hugely more complicated, and seems worth it.

yes. agree. there's a thread going on that topic. i'm following it.

[ ... ]

>> anyway, I think that some CLONE_NEW* should be forbidden. Daniel should
>> send soon a little patch for the ns_cgroup restricting the clone flags
>> being used in a container.
>
> Uh, that feels a bit over the top. We want to make this
> uncheckpointable (if it remains so), not prevent the whole action.
> After all I may be running a container which I don't plan on ever
> checkpointing, and inside that container running a job which i do
> want to migrate.

ok. i've been scanning the emails a bit fast. that would be fine
and useful.

> So depending on if we're doing the Dave or the rest-of-the-world
> way :), we either clear_bit(pidns->may_checkpoint) on the parent
> pid_ns when a child is created, or we walk every task being
> checkpointed and make sure they each are in the same pid_ns.
> Doesn't that suffice?

yes. this 'may_checkpoint' is a container level info so I wonder
where you store it. in a cgroup_checkpoint ? sorry for jumping in
and may be restarting some old topics of discussion.

C.

2009-03-13 17:12:50

by Greg Kurz

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

On Fri, 2009-03-13 at 16:27 +0100, Cedric Le Goater wrote:
> Ying Han wrote:
> > Hi Serge:
> > I made a patch based on Oren's tree recently which implement a new
> > syscall clone_with_pid. I tested with checkpoint/restart process tree
> > and it works as expected.
> > This patch has some hack in it which i made a copy of libc's clone and
> > made modifications of passing one more argument(pid number). I will
> > try to clean up the code and do more testing.
>
> ok. 2 patches would also be interesting. one for the syscall and one
> for the kernel support (which might be acceptable)
>
> > New syscall clone_with_pid
> > Implement a new syscall which clone a thread with a preselected pid number.
>
> yes this definitely needed to restart a task/thread. we maintain an ugly
> hack which stores a pid in the current task that will be used by the next
> clone() call.
>

That's probably better as you say... but damned, sys_clone() is arch
dependant so much files to patch. :)

> > clone_with_pid(child_func, child_stack + CHILD_STACK - 16,
> > CLONE_WITH_PID|SIGCHLD, pid, NULL);
>
> since you're introducing a new syscall, I don't see why you need a
> CLONE_WITH_PID flag ?
>
> (FYI, attached is my old attempt of clone_with_pid but that's too old)
>
> [ ... ]
>
> > +#define DEBUG
> > #include <linux/slab.h>
> > #include <linux/init.h>
> > #include <linux/unistd.h>
> > @@ -959,10 +959,19 @@ static struct task_struct *copy_process(unsigned long cl
> > int retval;
> > struct task_struct *p;
> > int cgroup_callbacks_done = 0;
> > + pid_t clone_pid = stack_size;
> >
> > if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
> > return ERR_PTR(-EINVAL);
> >
> > + /* We only allow the clone_with_pid when a new pid namespace is
> > + * created. FIXME: how to restrict it.
> > + */
> > + if ((clone_flags & CLONE_NEWPID) && (clone_flags & CLONE_WITH_PID))
> > + return ERR_PTR(-EINVAL);
> > + if ((clone_flags & CLONE_WITH_PID) && (clone_pid <= 1))
> > + return ERR_PTR(-EINVAL);
>
> I would let alloc_pid() handle the error.
>
> > /*
> > * Thread groups must share signals as well, and detached threads
> > * can only be started up within the thread group.
> > @@ -1135,7 +1144,10 @@ static struct task_struct *copy_process(unsigned long c
> >
> > if (pid != &init_struct_pid) {
> > retval = -ENOMEM;
> > - pid = alloc_pid(task_active_pid_ns(p));
> > + if (clone_flags & CLONE_WITH_PID)
> > + pid = alloc_pid(task_active_pid_ns(p), clone_pid);
> > + else
> > + pid = alloc_pid(task_active_pid_ns(p), 0);
>
> this is overkill IMO.
>
> [ ... ]
>
> > -static int alloc_pidmap(struct pid_namespace *pid_ns)
> > +static int alloc_pidmap(struct pid_namespace *pid_ns, pid_t pid_nr)
> > {
> > int i, offset, max_scan, pid, last = pid_ns->last_pid;
> > struct pidmap *map;
> >
> > - pid = last + 1;
> > + if (pid_nr)
> > + pid = pid_nr;
> > + else
> > + pid = last + 1;
> >
> > if (pid >= pid_max)
> > pid = RESERVED_PIDS;
> > offset = pid & BITS_PER_PAGE_MASK;
> > @@ -153,9 +156,12 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
> > do {
> > if (!test_and_set_bit(offset, map->page)) {
> > atomic_dec(&map->nr_free);
> > - pid_ns->last_pid = pid;
> > + if (!pid_nr)
> > + pid_ns->last_pid = pid;
> > return pid;
> > }
> > + if (pid_nr)
> > + return -1;
> > offset = find_next_offset(map, offset);
> > pid = mk_pid(pid_ns, map, offset);
> > /*
> > @@ -239,21 +245,25 @@ void free_pid(struct pid *pid)
> > call_rcu(&pid->rcu, delayed_put_pid);
> > }
> >
> > -struct pid *alloc_pid(struct pid_namespace *ns)
> > +struct pid *alloc_pid(struct pid_namespace *ns, pid_t pid_nr)
> > {
> > struct pid *pid;
> > enum pid_type type;
> > int i, nr;
> > struct pid_namespace *tmp;
> > struct upid *upid;
> > + int level = ns->level;
> > +
> > + if (pid_nr >= pid_max)
> > + return NULL;
>
> let alloc_pidmap() handle it ?
>
> >
> > pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
> > if (!pid)
> > goto out;
> >
> > - tmp = ns;
> > - for (i = ns->level; i >= 0; i--) {
> > - nr = alloc_pidmap(tmp);
> > + tmp = ns->parent;
> > + for (i = level-1; i >= 0; i--) {
> > + nr = alloc_pidmap(tmp, 0);
> > if (nr < 0)
> > goto out_free;
> >
> > @@ -262,6 +272,14 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> > tmp = tmp->parent;
> > }
> >
> > + nr = alloc_pidmap(ns, pid_nr);
> > + if (nr < 0)
> > + goto out_free;
> > + pid->numbers[level].nr = nr;
> > + pid->numbers[level].ns = ns;
> > + if (nr == pid_nr)
> > + pr_debug("nr == pid_nr == %d\n", nr);
> > +
> > get_pid_ns(ns);
> > pid->level = ns->level;
> > atomic_set(&pid->count, 1);
> >
> >
> >
> >
> >
> >
> >
> >
> > On Thu, Mar 12, 2009 at 2:21 PM, Serge E. Hallyn <[email protected]> wrote:
> >> Quoting Greg Kurz ([email protected]):
> >>> On Thu, 2009-03-12 at 09:53 -0500, Serge E. Hallyn wrote:
> >>>> Or are you suggesting that you'll do a dummy clone of (5594,2) so that
> >>>> the next clone(CLONE_NEWPID) will be expected to be (5594,3,1)?
> >>>>
> >>> Of course not
> >> Ok - someone *did* argue that at some point I think...
> >>
> >>> but one should be able to tell clone() to pick a specific
> >>> pid.
> >> Can you explain exactly how? I must be missing something clever.
> >>
> >> -serge
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to [email protected]. For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
> plain text document attachment (clone_with_pid.patch)
> Subject: [RFC] forkpid() syscall
>
> From: Cedric Le Goater <[email protected]>
>
> let's the user specify a pid to fork and return EBUSY if the pid is
> not available.
>
> this patch includes a alloc_pid*() cleanup on the way errors are
> returned that could be pushed to mainline independently.
>
> usage :
>
> #include <sys/syscall.h>
>
> #define __NR_forkpid 324
>
> static inline int forkpid(int pid)
> {
> return syscall(__NR_forkpid, pid);
> }
>
> caveats :
> fork oriented, should also cover clone
> i386 only
> does not cover 64 bits clone flags
>
>
> Signed-off-by: Cedric Le Goater <[email protected]>
> ---
> arch/i386/kernel/process.c | 15 +++++++++++----
> arch/i386/kernel/syscall_table.S | 1 +
> include/asm-i386/unistd.h | 3 ++-
> include/linux/pid.h | 2 +-
> include/linux/sched.h | 2 +-
> kernel/fork.c | 9 +++++----
> kernel/pid.c | 28 +++++++++++++++-------------
> 7 files changed, 36 insertions(+), 24 deletions(-)
>
> Index: 2.6.22/kernel/pid.c
> ===================================================================
> --- 2.6.22.orig/kernel/pid.c
> +++ 2.6.22/kernel/pid.c
> @@ -96,12 +96,12 @@ static fastcall void free_pidmap(struct
> atomic_inc(&map->nr_free);
> }
>
> -static int alloc_pidmap(struct pid_namespace *pid_ns)
> +static int alloc_pidmap(struct pid_namespace *pid_ns, pid_t upid)
> {
> int i, offset, max_scan, pid, last = pid_ns->last_pid;
> struct pidmap *map;
>
> - pid = last + 1;
> + pid = upid ? upid : last + 1;
> if (pid >= pid_max)
> pid = RESERVED_PIDS;
> offset = pid & BITS_PER_PAGE_MASK;
> @@ -130,6 +130,8 @@ static int alloc_pidmap(struct pid_names
> pid_ns->last_pid = pid;
> return pid;
> }
> + if (upid)
> + return -EBUSY;
> offset = find_next_offset(map, offset);
> pid = mk_pid(pid_ns, map, offset);
> /*
> @@ -153,7 +155,7 @@ static int alloc_pidmap(struct pid_names
> }
> pid = mk_pid(pid_ns, map, offset);
> }
> - return -1;
> + return -EAGAIN;
> }
>
> static int next_pidmap(struct pid_namespace *pid_ns, int last)
> @@ -203,19 +205,24 @@ fastcall void free_pid(struct pid *pid)
> call_rcu(&pid->rcu, delayed_put_pid);
> }
>
> -struct pid *alloc_pid(void)
> +struct pid *alloc_pid(pid_t upid)
> {
> struct pid *pid;
> enum pid_type type;
> int nr = -1;
>
> pid = kmem_cache_alloc(pid_cachep, GFP_KERNEL);
> - if (!pid)
> + if (!pid) {
> + pid = ERR_PTR(-ENOMEM);
> goto out;
> + }
>
> - nr = alloc_pidmap(current->nsproxy->pid_ns);
> - if (nr < 0)
> - goto out_free;
> + nr = alloc_pidmap(current->nsproxy->pid_ns, upid);
> + if (nr < 0) {
> + kmem_cache_free(pid_cachep, pid);
> + pid = ERR_PTR(nr);
> + goto out;
> + }
>
> atomic_set(&pid->count, 1);
> pid->nr = nr;
> @@ -228,11 +235,6 @@ struct pid *alloc_pid(void)
>
> out:
> return pid;
> -
> -out_free:
> - kmem_cache_free(pid_cachep, pid);
> - pid = NULL;
> - goto out;
> }
>
> struct pid * fastcall find_pid(int nr)
> Index: 2.6.22/arch/i386/kernel/process.c
> ===================================================================
> --- 2.6.22.orig/arch/i386/kernel/process.c
> +++ 2.6.22/arch/i386/kernel/process.c
> @@ -355,7 +355,7 @@ int kernel_thread(int (*fn)(void *), voi
> regs.eflags = X86_EFLAGS_IF | X86_EFLAGS_SF | X86_EFLAGS_PF | 0x2;
>
> /* Ok, create the new process.. */
> - return do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL, NULL);
> + return do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL, NULL, 0);
> }
> EXPORT_SYMBOL(kernel_thread);
>
> @@ -722,9 +722,16 @@ struct task_struct fastcall * __switch_t
> return prev_p;
> }
>
> +asmlinkage int sys_forkpid(struct pt_regs regs)
> +{
> + pid_t pid = regs.ebx;
> +
> + return do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL, pid);
> +}
> +
> asmlinkage int sys_fork(struct pt_regs regs)
> {
> - return do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
> + return do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL, 0);
> }
>
> asmlinkage int sys_clone(struct pt_regs regs)
> @@ -739,7 +746,7 @@ asmlinkage int sys_clone(struct pt_regs
> child_tidptr = (int __user *)regs.edi;
> if (!newsp)
> newsp = regs.esp;
> - return do_fork(clone_flags, newsp, &regs, 0, parent_tidptr, child_tidptr);
> + return do_fork(clone_flags, newsp, &regs, 0, parent_tidptr, child_tidptr, 0);
> }
>
> /*
> @@ -754,7 +761,7 @@ asmlinkage int sys_clone(struct pt_regs
> */
> asmlinkage int sys_vfork(struct pt_regs regs)
> {
> - return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
> + return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0, NULL, NULL, 0);
> }
>
> /*
> Index: 2.6.22/arch/i386/kernel/syscall_table.S
> ===================================================================
> --- 2.6.22.orig/arch/i386/kernel/syscall_table.S
> +++ 2.6.22/arch/i386/kernel/syscall_table.S
> @@ -323,3 +323,4 @@ ENTRY(sys_call_table)
> .long sys_signalfd
> .long sys_timerfd
> .long sys_eventfd
> + .long sys_forkpid
> Index: 2.6.22/include/asm-i386/unistd.h
> ===================================================================
> --- 2.6.22.orig/include/asm-i386/unistd.h
> +++ 2.6.22/include/asm-i386/unistd.h
> @@ -329,10 +329,11 @@
> #define __NR_signalfd 321
> #define __NR_timerfd 322
> #define __NR_eventfd 323
> +#define __NR_forkpid 324
>
> #ifdef __KERNEL__
>
> -#define NR_syscalls 324
> +#define NR_syscalls 325
>
> #define __ARCH_WANT_IPC_PARSE_VERSION
> #define __ARCH_WANT_OLD_READDIR
> Index: 2.6.22/kernel/fork.c
> ===================================================================
> --- 2.6.22.orig/kernel/fork.c
> +++ 2.6.22/kernel/fork.c
> @@ -1358,15 +1358,16 @@ long do_fork(unsigned long clone_flags,
> struct pt_regs *regs,
> unsigned long stack_size,
> int __user *parent_tidptr,
> - int __user *child_tidptr)
> + int __user *child_tidptr,
> + pid_t upid)
> {
> struct task_struct *p;
> int trace = 0;
> - struct pid *pid = alloc_pid();
> + struct pid *pid = alloc_pid(upid);
> long nr;
>
> - if (!pid)
> - return -EAGAIN;
> + if (IS_ERR(pid))
> + return PTR_ERR(pid);
> nr = pid->nr;
> if (unlikely(current->ptrace)) {
> trace = fork_traceflag (clone_flags);
> Index: 2.6.22/include/linux/sched.h
> ===================================================================
> --- 2.6.22.orig/include/linux/sched.h
> +++ 2.6.22/include/linux/sched.h
> @@ -1433,7 +1433,7 @@ extern int allow_signal(int);
> extern int disallow_signal(int);
>
> extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
> -extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
> +extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *, pid_t);
> struct task_struct *fork_idle(int);
>
> extern void set_task_comm(struct task_struct *tsk, char *from);
> Index: 2.6.22/include/linux/pid.h
> ===================================================================
> --- 2.6.22.orig/include/linux/pid.h
> +++ 2.6.22/include/linux/pid.h
> @@ -95,7 +95,7 @@ extern struct pid *FASTCALL(find_pid(int
> extern struct pid *find_get_pid(int nr);
> extern struct pid *find_ge_pid(int nr);
>
> -extern struct pid *alloc_pid(void);
> +extern struct pid *alloc_pid(pid_t upid);
> extern void FASTCALL(free_pid(struct pid *pid));
>
> static inline pid_t pid_nr(struct pid *pid)
--
Gregory Kurz [email protected]
Software Engineer @ IBM/Meiosys http://www.ibm.com
Tel +33 (0)534 638 479 Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
Alan Moore.

2009-03-13 17:37:47

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

Quoting Ying Han ([email protected]):
> Hi Serge:
> I made a patch based on Oren's tree recently which implement a new
> syscall clone_with_pid. I tested with checkpoint/restart process tree
> and it works as expected.
> This patch has some hack in it which i made a copy of libc's clone and
> made modifications of passing one more argument(pid number). I will
> try to clean up the code and do more testing.
>
> New syscall clone_with_pid
> Implement a new syscall which clone a thread with a preselected pid number.
>
> clone_with_pid(child_func, child_stack + CHILD_STACK - 16,
> CLONE_WITH_PID|SIGCHLD, pid, NULL);
>
> Signed-off-by: Ying Han <[email protected]>
>
> diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
> index 87803da..b5a1b03 100644
> --- a/arch/x86/include/asm/syscalls.h
> +++ b/arch/x86/include/asm/syscalls.h
> @@ -26,6 +26,7 @@ asmlinkage int sys_fork(struct pt_regs);
> asmlinkage int sys_clone(struct pt_regs);
> asmlinkage int sys_vfork(struct pt_regs);
> asmlinkage int sys_execve(struct pt_regs);
> +asmlinkage int sys_clone_with_pid(struct pt_regs);
>
> /* kernel/signal_32.c */
> asmlinkage int sys_sigsuspend(int, int, old_sigset_t);
> diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32
> index a5f9e09..f10ca0e 100644
> --- a/arch/x86/include/asm/unistd_32.h
> +++ b/arch/x86/include/asm/unistd_32.h
> @@ -340,6 +340,7 @@
> #define __NR_inotify_init1 332
> #define __NR_checkpoint 333
> #define __NR_restart 334
> +#define __NR_clone_with_pid 335
>
> #ifdef __KERNEL__
>
> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index 0a1302f..88ae634 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -8,7 +8,6 @@
> /*
> * This file handles the architecture-dependent parts of process handling..
> */
> -
> #include <stdarg.h>
>
> #include <linux/cpu.h>
> @@ -652,6 +651,28 @@ asmlinkage int sys_clone(struct pt_regs regs)
> return do_fork(clone_flags, newsp, &regs, 0, parent_tidptr, child_tidptr);
> }
>
> +/**
> + * sys_clone_with_pid - clone a thread with pre-select pid number.
> + */
> +asmlinkage int sys_clone_with_pid(struct pt_regs regs)
> +{
> + unsigned long clone_flags;
> + unsigned long newsp;
> + int __user *parent_tidptr, *child_tidptr;
> + pid_t pid_nr;
> +
> + clone_flags = regs.bx;
> + newsp = regs.cx;
> + parent_tidptr = (int __user *)regs.dx;
> + child_tidptr = (int __user *)regs.di;
> + pid_nr = regs.bp;

Hi,

Thanks for looking at this. I appreciate the patch. Two comments
however.

As I was saying in another email, i think that so long as we are going
with a new syscall, we should make sure that it suffices for nested
pid namespaces. So send in an array of pids and its lengths, then
use an algorithm like Alexey's to fill in the sent-in pids if possible.

> + if (!newsp)
> + newsp = regs.sp;
> + return do_fork(clone_flags, newsp, &regs, pid_nr, parent_tidptr,
> + child_tidptr);
> +}
> +
> /*
> * This is trivial, and on the face of it looks like it
> * could equally well be done in user mode.

> diff --git a/kernel/fork.c b/kernel/fork.c
> index 085ce56..262ae1e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -10,7 +10,7 @@
> * Fork is rather simple, once you get the hang of it, but the memory
> * management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
> */
> -
> +#define DEBUG
> #include <linux/slab.h>
> #include <linux/init.h>
> #include <linux/unistd.h>
> @@ -959,10 +959,19 @@ static struct task_struct *copy_process(unsigned long cl
> int retval;
> struct task_struct *p;
> int cgroup_callbacks_done = 0;
> + pid_t clone_pid = stack_size;

Note that some architectures (i.e. ia64) actually use the stack_size
sent to copy_thread, so you at least need to zero out stack_size here.

And I suspect there are cases where a stack_size is actually sent in,
so this doesn't seem legitimate, but I haven't tracked down the callers.
If you tell me you think there should be no case where a real stack_size
is sent in, well, I'll feel better if you prove it by breaking the patch
up so that:

1. you remove the stack_size argument from copy_process. Test such a
kernel on some architectures (boot+ltp, for instance).

2. add a chosen_pid argument to copy_process.

Then we can be sure noone is using the field.

thanks,
-serge

2009-03-13 17:39:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?



On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote:

> Ying Han [[email protected]] wrote:
> | Hi Serge:
> | I made a patch based on Oren's tree recently which implement a new
> | syscall clone_with_pid. I tested with checkpoint/restart process tree
> | and it works as expected.
>
> Yes, I think we had a version of clone() with pid a while ago.

Are people _at_all_ thinking about security?

Obviously not.

There's no way we can do anything like this. Sure, it's trivial to do
inside the kernel. But it also sounds like a _wonderful_ attack vector
against badly written user-land software that sends signals and has small
races.

Quite frankly, from having followed the discussion(s) over the last few
weeks about checkpoint/restart in various forms, my reaction to just about
_all_ of this is that people pushing this are pretty damn borderline.

I think you guys are working on all the wrong problems.

Let's face it, we're not going to _ever_ checkpoint any kind of general
case process. Just TCP makes that fundamentally impossible in the general
case, and there are lots and lots of other cases too (just something as
totally _trivial_ as all the files in the filesystem that don't get rolled
back).

So unless people start realizing that
(a) processes that want to be checkpointed had better be ready and aware
of it, and help out
(b) there's no way in hell that we're going to add these kinds of
interfaces that have dubious upsides (just teach the damn program
you're checkpointing that pids will change, and admit to everybody
that people who want to be checkpointed need to do work) and are
potential security holes.
(c) if you are going to play any deeper games, you need to have
privileges. IOW, "clone_with_pid()" is ok for _root_, but not for
some random user. And you'd better keep that in mind EVERY SINGLE
STEP OF THE WAY.

I'm really fed up with these discussions. I have seen almost _zero_
critical thinking at all. Probably because anybody who is in the least
doubtful about it simply has tuned out the discussion. So here's my input:
start small, start over, and start thinking about other issues than just
checkpointing.

Linus

2009-03-13 19:03:24

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

Quoting Linus Torvalds ([email protected]):
>
>
> On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote:
>
> > Ying Han [[email protected]] wrote:
> > | Hi Serge:
> > | I made a patch based on Oren's tree recently which implement a new
> > | syscall clone_with_pid. I tested with checkpoint/restart process tree
> > | and it works as expected.
> >
> > Yes, I think we had a version of clone() with pid a while ago.
>
> Are people _at_all_ thinking about security?
>
> Obviously not.
>
> There's no way we can do anything like this. Sure, it's trivial to do
> inside the kernel. But it also sounds like a _wonderful_ attack vector
> against badly written user-land software that sends signals and has small
> races.

If we're worried about that, one way we could address it is to tag a
pid_ns with the userid of whoever created it, and enforce that you
can only specify a pid in a pid_ns which you own.

What openvz does is have sys_restart create the whole process tree,
including custom pids (only in the new private namespaces), from inside
the kernel. That has the same effect of only allowing specification of
pids in your own private pid namespaces.

> Quite frankly, from having followed the discussion(s) over the last few
> weeks about checkpoint/restart in various forms, my reaction to just about
> _all_ of this is that people pushing this are pretty damn borderline.
>
> I think you guys are working on all the wrong problems.
>
> Let's face it, we're not going to _ever_ checkpoint any kind of general
> case process. Just TCP makes that fundamentally impossible in the general
> case, and there are lots and lots of other cases too (just something as
> totally _trivial_ as all the files in the filesystem that don't get rolled
> back).

I'm pretty sure that each of Openvz, Metacluster, and Zap are able to
checkpoint, restart, and migrate tasks which are actively using TCP.
Of course doing rollback of one endpoint of an ongoing communication
would be silly, but migration is possible.

> So unless people start realizing that
> (a) processes that want to be checkpointed had better be ready and aware
> of it, and help out
> (b) there's no way in hell that we're going to add these kinds of
> interfaces that have dubious upsides (just teach the damn program
> you're checkpointing that pids will change, and admit to everybody
> that people who want to be checkpointed need to do work) and are
> potential security holes.
> (c) if you are going to play any deeper games, you need to have
> privileges. IOW, "clone_with_pid()" is ok for _root_, but not for
> some random user. And you'd better keep that in mind EVERY SINGLE
> STEP OF THE WAY.

Yes, that is why we're keeping from requiring privilege so far - to make
sure that at each step we have to consider security requirements.

> I'm really fed up with these discussions. I have seen almost _zero_
> critical thinking at all. Probably because anybody who is in the least
> doubtful about it simply has tuned out the discussion. So here's my input:
> start small, start over, and start thinking about other issues than just
> checkpointing.

The first set of patches from Oren is intended to do just that. It
certainly did not have any sort of clone_with_pid() equivalent.

-serge

2009-03-13 19:28:17

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

On Fri, Mar 13, 2009 at 10:27:54AM -0700, Linus Torvalds wrote:
> On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote:
>
> > Ying Han [[email protected]] wrote:
> > | Hi Serge:
> > | I made a patch based on Oren's tree recently which implement a new
> > | syscall clone_with_pid. I tested with checkpoint/restart process tree
> > | and it works as expected.
> >
> > Yes, I think we had a version of clone() with pid a while ago.
>
> Are people _at_all_ thinking about security?
>
> Obviously not.

For the record, OpenVZ always have CAP_SYS_ADMIN check on restore.
And CAP_SYS_ADMIN will be in version to be sent out.

Not having it is one big security hole.

> There's no way we can do anything like this. Sure, it's trivial to do
> inside the kernel. But it also sounds like a _wonderful_ attack vector
> against badly written user-land software that sends signals and has small
> races.
>
> Quite frankly, from having followed the discussion(s) over the last few
> weeks about checkpoint/restart in various forms, my reaction to just about
> _all_ of this is that people pushing this are pretty damn borderline.
>
> I think you guys are working on all the wrong problems.
>
> Let's face it, we're not going to _ever_ checkpoint any kind of general
> case process. Just TCP makes that fundamentally impossible in the general
> case, and there are lots and lots of other cases too (just something as
> totally _trivial_ as all the files in the filesystem that don't get rolled
> back).

What do you mean here? Unlinked files?

> So unless people start realizing that
> (a) processes that want to be checkpointed had better be ready and aware
> of it, and help out

This is not going to happen. Userspace authors won't do anything
(nor they shouldn't).

> (b) there's no way in hell that we're going to add these kinds of
> interfaces that have dubious upsides (just teach the damn program
> you're checkpointing that pids will change, and admit to everybody
> that people who want to be checkpointed need to do work) and are
> potential security holes.

I personally don't understand why on earth clone_with_pid() is again
with us.

As if pids are somehow unique among other resources.

It was discussed when IPC objects creation with specific parameters were
discussed.

"struct pid" and "struct pid_namespace" can be trivially restored
without leaking to userspace.

People probably assume that task should be restored with clone(2) which
is unnatural given relations between task_struct, nsproxy and individual
struct foo_namespace's

> (c) if you are going to play any deeper games, you need to have
> privileges. IOW, "clone_with_pid()" is ok for _root_, but not for
> some random user. And you'd better keep that in mind EVERY SINGLE
> STEP OF THE WAY.
>
> I'm really fed up with these discussions. I have seen almost _zero_
> critical thinking at all. Probably because anybody who is in the least
> doubtful about it simply has tuned out the discussion. So here's my input:
> start small, start over, and start thinking about other issues than just
> checkpointing.

2009-03-13 20:52:12

by Mike Waychison

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

Linus Torvalds wrote:
>
> On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote:
>
>> Ying Han [[email protected]] wrote:
>> | Hi Serge:
>> | I made a patch based on Oren's tree recently which implement a new
>> | syscall clone_with_pid. I tested with checkpoint/restart process tree
>> | and it works as expected.
>>
>> Yes, I think we had a version of clone() with pid a while ago.
>
> Are people _at_all_ thinking about security?
>
> Obviously not.
>
> There's no way we can do anything like this. Sure, it's trivial to do
> inside the kernel. But it also sounds like a _wonderful_ attack vector
> against badly written user-land software that sends signals and has small
> races.

I'm not really sure how this is different than a malicious app going off
and spawning thousands of threads in an attempt to hit a target pid from
a security pov. Sure, it makes it easier, but it's not like there is
anything in place to close the attack vector.

>
> Quite frankly, from having followed the discussion(s) over the last few
> weeks about checkpoint/restart in various forms, my reaction to just about
> _all_ of this is that people pushing this are pretty damn borderline.
>
> I think you guys are working on all the wrong problems.
>
> Let's face it, we're not going to _ever_ checkpoint any kind of general
> case process. Just TCP makes that fundamentally impossible in the general
> case, and there are lots and lots of other cases too (just something as
> totally _trivial_ as all the files in the filesystem that don't get rolled
> back).

In some instances such as ours, TCP is probably the easiest thing to
migrate. In an rpc-based cluster application, TCP is nothing more than
an RPC channel and applications already have to handle RPC channel
failure and re-establishment.

I agree that this is not the 'general case' as you mention above
however. This is the bit that sorta bothers me with the way the
implementation has been going so far on this list. The implementation
that folks are building on top of Oren's patchset tries to be everything
to everybody. For our purposes, we need to have the flexibility of
choosing *how* we checkpoint. The line seems to be arbitrarily drawn at
the kernel being responsible for checkpointing and restoring all
resources associated with a task, and leaving userland with nothing more
than transporting filesystem bits. This approach isn't flexible enough:
Consider the case where we want to stub out most of the TCP file
descriptors with ECONNRESETed sockets because we know that they are RPC
sockets and can re-establish themselves, but we want to use some other
mechanism for TCP sockets we don't know much about. The current
monolithic approach has zero flexibility for doing anything like this,
and I figure out how we could even fit anything like this in.

This sort of problem is pushing me to wanting all this stuff to live up
in userland. The 'core dump'ish way of checkpointing is a great way to
prototype some of the requirements, but it's going to end up being
pretty difficult to do anything interesting long term and this is going
to stifle any chance of this getting productized in our environments.

>
> So unless people start realizing that
> (a) processes that want to be checkpointed had better be ready and aware
> of it, and help out
> (b) there's no way in hell that we're going to add these kinds of
> interfaces that have dubious upsides (just teach the damn program
> you're checkpointing that pids will change, and admit to everybody
> that people who want to be checkpointed need to do work) and are
> potential security holes.

This is a bit ridiculous. This is akin to asking programs to recognize
that their heap addresses may change.

> (c) if you are going to play any deeper games, you need to have
> privileges. IOW, "clone_with_pid()" is ok for _root_, but not for
> some random user. And you'd better keep that in mind EVERY SINGLE
> STEP OF THE WAY.
>
> I'm really fed up with these discussions. I have seen almost _zero_
> critical thinking at all. Probably because anybody who is in the least
> doubtful about it simply has tuned out the discussion. So here's my input:
> start small, start over, and start thinking about other issues than just
> checkpointing.
>
> Linus
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/containers

2009-03-13 21:11:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?



On Fri, 13 Mar 2009, Alexey Dobriyan wrote:
> >
> > Let's face it, we're not going to _ever_ checkpoint any kind of general
> > case process. Just TCP makes that fundamentally impossible in the general
> > case, and there are lots and lots of other cases too (just something as
> > totally _trivial_ as all the files in the filesystem that don't get rolled
> > back).
>
> What do you mean here? Unlinked files?

Or modified files, or anything else. "External state" is a pretty damn
wide net. It's not just TCP sequence numbers and another machine.

Linus

2009-03-13 21:51:55

by Dave Hansen

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

On Fri, 2009-03-13 at 14:01 -0700, Linus Torvalds wrote:
> On Fri, 13 Mar 2009, Alexey Dobriyan wrote:
> > > Let's face it, we're not going to _ever_ checkpoint any kind of general
> > > case process. Just TCP makes that fundamentally impossible in the general
> > > case, and there are lots and lots of other cases too (just something as
> > > totally _trivial_ as all the files in the filesystem that don't get rolled
> > > back).
> >
> > What do you mean here? Unlinked files?
>
> Or modified files, or anything else. "External state" is a pretty damn
> wide net. It's not just TCP sequence numbers and another machine.

This is precisely the reason that we've focused so hard on containers,
and *didn't* just jump right into checkpoint/restart; we're trying
really hard to constrain the _truly_ external things that a process can
interact with.

The approach so far has largely been to make things are external to a
process at least *internal* to a container. Network, pid, ipc, and uts
namespaces, for example. An ipc/sem.c semaphore may be external to a
process, so we'll just pick the whole namespace up and checkpoint it
along with the process.

In the OpenVZ case, they've at least demonstrated that the filesystem
can be moved largely with rsync. Unlinked files need some in-kernel TLC
(or /proc mangling) but it isn't *that* bad.

We can also make the fs problem much easier by using things like dm or
btrfs snapshotting of the block device, or restricting to where on a fs
a container is allowed to write with stuff like r/o bind mounts.

-- Dave

2009-03-13 22:17:39

by Oren Laadan

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?



Dave Hansen wrote:
> On Fri, 2009-03-13 at 14:01 -0700, Linus Torvalds wrote:
>> On Fri, 13 Mar 2009, Alexey Dobriyan wrote:
>>>> Let's face it, we're not going to _ever_ checkpoint any kind of general
>>>> case process. Just TCP makes that fundamentally impossible in the general
>>>> case, and there are lots and lots of other cases too (just something as
>>>> totally _trivial_ as all the files in the filesystem that don't get rolled
>>>> back).
>>> What do you mean here? Unlinked files?
>> Or modified files, or anything else. "External state" is a pretty damn
>> wide net. It's not just TCP sequence numbers and another machine.
>
> This is precisely the reason that we've focused so hard on containers,
> and *didn't* just jump right into checkpoint/restart; we're trying
> really hard to constrain the _truly_ external things that a process can
> interact with.
>
> The approach so far has largely been to make things are external to a
> process at least *internal* to a container. Network, pid, ipc, and uts
> namespaces, for example. An ipc/sem.c semaphore may be external to a
> process, so we'll just pick the whole namespace up and checkpoint it
> along with the process.
>
> In the OpenVZ case, they've at least demonstrated that the filesystem
> can be moved largely with rsync. Unlinked files need some in-kernel TLC
> (or /proc mangling) but it isn't *that* bad.

And in the Zap we have successfully used a log-based filesystem
(specifically NILFS) to continuously snapshot the file-system atomically
with taking a checkpoint, so it can easily branch off past checkpoints,
including the file system.

And unlinked files can be (inefficiently) handled by saving their full
contents with the checkpoint image - it's not a big toll on many apps
(if you exclude Wine and UML...). At least that's a start.

>
> We can also make the fs problem much easier by using things like dm or
> btrfs snapshotting of the block device, or restricting to where on a fs
> a container is allowed to write with stuff like r/o bind mounts.

(or NILFS)

So we argue that the FS snapshotting is related, but orthogonal in terms
of implementation to c/r.

Oren.

2009-03-13 22:39:49

by Oren Laadan

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?



Mike Waychison wrote:
> Linus Torvalds wrote:
>> On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote:
>>
>>> Ying Han [[email protected]] wrote:
>>> | Hi Serge:
>>> | I made a patch based on Oren's tree recently which implement a new
>>> | syscall clone_with_pid. I tested with checkpoint/restart process tree
>>> | and it works as expected.
>>>
>>> Yes, I think we had a version of clone() with pid a while ago.
>> Are people _at_all_ thinking about security?
>>
>> Obviously not.
>>
>> There's no way we can do anything like this. Sure, it's trivial to do
>> inside the kernel. But it also sounds like a _wonderful_ attack vector
>> against badly written user-land software that sends signals and has small
>> races.
>
> I'm not really sure how this is different than a malicious app going off
> and spawning thousands of threads in an attempt to hit a target pid from
> a security pov. Sure, it makes it easier, but it's not like there is
> anything in place to close the attack vector.
>
>> Quite frankly, from having followed the discussion(s) over the last few
>> weeks about checkpoint/restart in various forms, my reaction to just about
>> _all_ of this is that people pushing this are pretty damn borderline.
>>
>> I think you guys are working on all the wrong problems.
>>
>> Let's face it, we're not going to _ever_ checkpoint any kind of general
>> case process. Just TCP makes that fundamentally impossible in the general
>> case, and there are lots and lots of other cases too (just something as
>> totally _trivial_ as all the files in the filesystem that don't get rolled
>> back).
>
> In some instances such as ours, TCP is probably the easiest thing to
> migrate. In an rpc-based cluster application, TCP is nothing more than
> an RPC channel and applications already have to handle RPC channel
> failure and re-establishment.
>
> I agree that this is not the 'general case' as you mention above
> however. This is the bit that sorta bothers me with the way the
> implementation has been going so far on this list. The implementation
> that folks are building on top of Oren's patchset tries to be everything
> to everybody. For our purposes, we need to have the flexibility of
> choosing *how* we checkpoint. The line seems to be arbitrarily drawn at
> the kernel being responsible for checkpointing and restoring all
> resources associated with a task, and leaving userland with nothing more
> than transporting filesystem bits. This approach isn't flexible enough:
> Consider the case where we want to stub out most of the TCP file
> descriptors with ECONNRESETed sockets because we know that they are RPC
> sockets and can re-establish themselves, but we want to use some other
> mechanism for TCP sockets we don't know much about. The current
> monolithic approach has zero flexibility for doing anything like this,
> and I figure out how we could even fit anything like this in.

The flexibility exists, but wasn't spelled out, so here it is:

1) Similar to madvice(), I envision a cradvice() that could tell the c/r
something about specific resources, e.g.:
* cradvice(CR_ADV_MEM, ptr, len) -> don't save that memory, it's scratch
* cradvice(CR_ADV_SOCK, fd, CR_ADV_SOCK_RESET) -> reset connection on restart
etc .. (nevermind the exact interface right now)

2) Tasks can ask to be notified (e.g. register a signal) when a checkpoint
or a restart complete successfully. At that time they can do their private
house-keeping if they know better.

3) If restoring some resource is significantly easier in user space (e.g. a
file-descriptor of some special device which user space knows how to
re-initialize), then the restarting task can prepare it ahead of time,
and, call:
* cradvice(CR_ADV_USERFD, fd, 0) -> use the fd in place instead of trying
to restore it yourself.

Method #3 is what I used in Zap to implement distributed checkpoints, where
it is so much easier to recreate all network connections in user space then
putting that logic into the kernel.

Now, on the other hand, doing the c/r from userland is much less flexible
than in the kernel (e.g. epollfd, futex state and much more) and requires
exposing tremendous amount of in-kernel data to user space. And we all know
than exposing internals is always a one-way ticket :(

[...]

Oren.

2009-03-14 00:14:18

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

On Fri, Mar 13, 2009 at 02:01:50PM -0700, Linus Torvalds wrote:
>
>
> On Fri, 13 Mar 2009, Alexey Dobriyan wrote:
> > >
> > > Let's face it, we're not going to _ever_ checkpoint any kind of general
> > > case process. Just TCP makes that fundamentally impossible in the general
> > > case, and there are lots and lots of other cases too (just something as
> > > totally _trivial_ as all the files in the filesystem that don't get rolled
> > > back).
> >
> > What do you mean here? Unlinked files?
>
> Or modified files, or anything else. "External state" is a pretty damn
> wide net. It's not just TCP sequence numbers and another machine.

I think (I think) you're seriously underestimating what's doable with
kernel C/R and what's already done.

I was told (haven't seen it myself) that Oracle installations and
Counter Strike servers were moved between boxes just fine.

They were run in specially prepared environment of course, but still.

2009-03-14 00:54:25

by Eric W. Biederman

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

Oren Laadan <[email protected]> writes:

> Dave Hansen wrote:
>> On Fri, 2009-03-13 at 14:01 -0700, Linus Torvalds wrote:
>>> On Fri, 13 Mar 2009, Alexey Dobriyan wrote:
>>>>> Let's face it, we're not going to _ever_ checkpoint any kind of general
>>>>> case process. Just TCP makes that fundamentally impossible in the general
>>>>> case, and there are lots and lots of other cases too (just something as
>>>>> totally _trivial_ as all the files in the filesystem that don't get rolled
>>>>> back).
>>>> What do you mean here? Unlinked files?
>>> Or modified files, or anything else. "External state" is a pretty damn
>>> wide net. It's not just TCP sequence numbers and another machine.
>>
>> This is precisely the reason that we've focused so hard on containers,
>> and *didn't* just jump right into checkpoint/restart; we're trying
>> really hard to constrain the _truly_ external things that a process can
>> interact with.
>>
>> The approach so far has largely been to make things are external to a
>> process at least *internal* to a container. Network, pid, ipc, and uts
>> namespaces, for example. An ipc/sem.c semaphore may be external to a
>> process, so we'll just pick the whole namespace up and checkpoint it
>> along with the process.
>>
>> In the OpenVZ case, they've at least demonstrated that the filesystem
>> can be moved largely with rsync. Unlinked files need some in-kernel TLC
>> (or /proc mangling) but it isn't *that* bad.
>
> And in the Zap we have successfully used a log-based filesystem
> (specifically NILFS) to continuously snapshot the file-system atomically
> with taking a checkpoint, so it can easily branch off past checkpoints,
> including the file system.
>
> And unlinked files can be (inefficiently) handled by saving their full
> contents with the checkpoint image - it's not a big toll on many apps
> (if you exclude Wine and UML...). At least that's a start.

Oren we might want to do a proof of concept implementation like I did
with network namespaces. That is done in the community and goes far
enough to show we don't have horribly nasty code. The patches and
individual changes don't need to be quite perfect but close enough
that they can be considered for merging.

For the network namespace that seems to have made a big difference.

I'm afraid in our clean start we may have focused a little too much
on merging something simple and not gone far enough on showing that
things will work.

After I had that in the network namespace and we had a clear vision of
the direction. We started merging the individual patches and things
went well.

Eric

2009-03-14 08:13:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?


* Eric W. Biederman <[email protected]> wrote:

> >> In the OpenVZ case, they've at least demonstrated that the
> >> filesystem can be moved largely with rsync. Unlinked files
> >> need some in-kernel TLC (or /proc mangling) but it isn't
> >> *that* bad.
> >
> > And in the Zap we have successfully used a log-based
> > filesystem (specifically NILFS) to continuously snapshot the
> > file-system atomically with taking a checkpoint, so it can
> > easily branch off past checkpoints, including the file
> > system.
> >
> > And unlinked files can be (inefficiently) handled by saving
> > their full contents with the checkpoint image - it's not a
> > big toll on many apps (if you exclude Wine and UML...). At
> > least that's a start.
>
> Oren we might want to do a proof of concept implementation
> like I did with network namespaces. That is done in the
> community and goes far enough to show we don't have horribly
> nasty code. The patches and individual changes don't need to
> be quite perfect but close enough that they can be considered
> for merging.
>
> For the network namespace that seems to have made a big
> difference.
>
> I'm afraid in our clean start we may have focused a little too
> much on merging something simple and not gone far enough on
> showing that things will work.
>
> After I had that in the network namespace and we had a clear
> vision of the direction. We started merging the individual
> patches and things went well.

I'm curious: what is the actual end result other than good
looking code? In terms of tangible benefits to the everyday
Linux distro user. [This is not meant to be sarcastic, i'm
truly curious.]

Ingo

2009-03-14 08:27:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?


* Alexey Dobriyan <[email protected]> wrote:

> On Fri, Mar 13, 2009 at 02:01:50PM -0700, Linus Torvalds wrote:
> >
> >
> > On Fri, 13 Mar 2009, Alexey Dobriyan wrote:
> > > >
> > > > Let's face it, we're not going to _ever_ checkpoint any
> > > > kind of general case process. Just TCP makes that
> > > > fundamentally impossible in the general case, and there
> > > > are lots and lots of other cases too (just something as
> > > > totally _trivial_ as all the files in the filesystem
> > > > that don't get rolled back).
> > >
> > > What do you mean here? Unlinked files?
> >
> > Or modified files, or anything else. "External state" is a
> > pretty damn wide net. It's not just TCP sequence numbers and
> > another machine.
>
> I think (I think) you're seriously underestimating what's
> doable with kernel C/R and what's already done.
>
> I was told (haven't seen it myself) that Oracle installations
> and Counter Strike servers were moved between boxes just fine.
>
> They were run in specially prepared environment of course, but
> still.

That's the kind of stuff i'd like to see happen.

Right now the main 'enterprise' approach to do
migration/consolidation of server contexts is based on hardware
virtualization - but that pushes runtime overhead to the native
kernel and slows down the guest context as well - massively so.

Before we've blinked twice it will be a 'required' enterprise
feature and enterprise people will measure/benchmark Linux
server performance in guest context primarily and we'll have a
deep performance pit to dig ourselves out of.

We can ignore that trend as uninteresting (it is uninteresting
in a number of ways because it is partly driven by stupidity),
or we can do something about it while still advancing the
kernel.

With containers+checkpointing the code is a lot scarier (we
basically do system call virtualization), the environment
interactions are a lot wider and thus they are a lot more
difficult to handle - but it's all a lot faster as well, and
conceptually so. All the runtime overhead is pushed to the
checkpointing step - (with some minimal amount of data structure
isolation overhead).

I see three conceptual levels of virtualization:

- hardware based virtualization, for 'unaware OSs'

- system call based virtualization, for 'unaware software'

- no virtualization kernel help is needed _at all_ to
checkpoint 'aware' software. We have libraries to checkpoint
'aware' user-space just fine - and had them for a decade.

Ingo

2009-03-16 06:03:58

by Oren Laadan

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?



Ingo Molnar wrote:
> * Alexey Dobriyan <[email protected]> wrote:
>
>> On Fri, Mar 13, 2009 at 02:01:50PM -0700, Linus Torvalds wrote:
>>>
>>> On Fri, 13 Mar 2009, Alexey Dobriyan wrote:
>>>>> Let's face it, we're not going to _ever_ checkpoint any
>>>>> kind of general case process. Just TCP makes that
>>>>> fundamentally impossible in the general case, and there
>>>>> are lots and lots of other cases too (just something as
>>>>> totally _trivial_ as all the files in the filesystem
>>>>> that don't get rolled back).
>>>> What do you mean here? Unlinked files?
>>> Or modified files, or anything else. "External state" is a
>>> pretty damn wide net. It's not just TCP sequence numbers and
>>> another machine.
>> I think (I think) you're seriously underestimating what's
>> doable with kernel C/R and what's already done.
>>
>> I was told (haven't seen it myself) that Oracle installations
>> and Counter Strike servers were moved between boxes just fine.
>>
>> They were run in specially prepared environment of course, but
>> still.
>
> That's the kind of stuff i'd like to see happen.
>
> Right now the main 'enterprise' approach to do
> migration/consolidation of server contexts is based on hardware
> virtualization - but that pushes runtime overhead to the native
> kernel and slows down the guest context as well - massively so.
>
> Before we've blinked twice it will be a 'required' enterprise
> feature and enterprise people will measure/benchmark Linux
> server performance in guest context primarily and we'll have a
> deep performance pit to dig ourselves out of.
>
> We can ignore that trend as uninteresting (it is uninteresting
> in a number of ways because it is partly driven by stupidity),
> or we can do something about it while still advancing the
> kernel.
>
> With containers+checkpointing the code is a lot scarier (we
> basically do system call virtualization), the environment
> interactions are a lot wider and thus they are a lot more
> difficult to handle - but it's all a lot faster as well, and
> conceptually so. All the runtime overhead is pushed to the
> checkpointing step - (with some minimal amount of data structure
> isolation overhead).

It's worthwhile the make the distinction between virtualization and
checkpoint/restart (c/r). Virtualization is about decoupling of the
applications from the underlying operating system by providing a
private and and virtual namespace, that is - containers. Checkpoint/
restart is ability to save the state of a container so that it can
be restart later from that point.

The point is, that virtualization is *already* part of the kernel
through namespaces (pid, ipc, mounts, etc). This considerable body
of work was eventually merged and is mostly complete, covering most
of the environment interactions. The runtime overhead is negligible.

Seeing that namespaces are now part of the kernel, we now build on
the existing virtualization to allow checkpoint/restart. The code is
not at all scary: record the state on checkpoint, and restore it on
restart. There is no runtime overhead for checkpoint but the downtime
incurred on an application when it is frozen for the duration of the
checkpoint.

>
> I see three conceptual levels of virtualization:
>
> - hardware based virtualization, for 'unaware OSs'
>
> - system call based virtualization, for 'unaware software'
>
> - no virtualization kernel help is needed _at all_ to
> checkpoint 'aware' software. We have libraries to checkpoint
> 'aware' user-space just fine - and had them for a decade.

Checkpoint/restart is almost orthogonal to virtualization (c/r only
needs a way to request a specific resource identifier for resources
that it creates). Therefore, the effort required to allow c/r of
'aware' software is nearly the same as for 'unaware' software.

IMHO this is the natural next time: make the c/r useful and attractive
by making it transparent (support 'unaware' software), complete (cover
nearly all features) and efficient (with low application downtime).

And this is precisely what we aim for with the current patchset.

Oren.

2009-03-16 23:35:55

by Kevin Fox

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

On Sat, 2009-03-14 at 01:12 -0700, Ingo Molnar wrote:
>
> * Eric W. Biederman <[email protected]> wrote:
>
> > >> In the OpenVZ case, they've at least demonstrated that the
> > >> filesystem can be moved largely with rsync. Unlinked files
> > >> need some in-kernel TLC (or /proc mangling) but it isn't
> > >> *that* bad.
> > >
> > > And in the Zap we have successfully used a log-based
> > > filesystem (specifically NILFS) to continuously snapshot the
> > > file-system atomically with taking a checkpoint, so it can
> > > easily branch off past checkpoints, including the file
> > > system.
> > >
> > > And unlinked files can be (inefficiently) handled by saving
> > > their full contents with the checkpoint image - it's not a
> > > big toll on many apps (if you exclude Wine and UML...). At
> > > least that's a start.
> >
> > Oren we might want to do a proof of concept implementation
> > like I did with network namespaces. That is done in the
> > community and goes far enough to show we don't have horribly
> > nasty code. The patches and individual changes don't need to
> > be quite perfect but close enough that they can be considered
> > for merging.
> >
> > For the network namespace that seems to have made a big
> > difference.
> >
> > I'm afraid in our clean start we may have focused a little too
> > much on merging something simple and not gone far enough on
> > showing that things will work.
> >
> > After I had that in the network namespace and we had a clear
> > vision of the direction. We started merging the individual
> > patches and things went well.
>
> I'm curious: what is the actual end result other than good
> looking code? In terms of tangible benefits to the everyday
> Linux distro user. [This is not meant to be sarcastic, i'm
> truly curious.]

>From an ordinary user perspective, I hate loosing my desktop state every
time there is a power bump or a new kernel/video driver comes down from
the distro provider. Some of the stuff I loose:
*All my terminals
*many tabs and windows
*each in a different directory
*vi
*which files I was editing
*which function I was coding
*screen
*scrollback buffer's contents
*history for debugging code
*command line arguments
*State of running apps
*web browser
*Tabs, yes it saves urls on crash, but sometimes the page cant
come back up (say, because of a form)
*where the windows are on the desktop
*evolution
*what folder is selected
*which message within the folder is selected
*rhythmbox
*misc other apps

Being able to reboot and get back to exactly where I was before the
reboot would save me a lot of time restarting apps and getting my
desktop back to where it was before the reboot. I'd also be more
inclined to reboot to get security updates more frequently if I didn't
loose track of what I was doing in the session, making machines more
secure in the process.

Kevin

PS: Yes, I know both GNOME and KDE have tried to deal with some of this
with their session manager stuff, but it doesn't restore everything and
only supported by some apps. It would probably take more work to get all
apps working with the session management stuff then supporting kernel
C/R.

> Ingo
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/containers
>
>

2009-03-18 18:58:14

by Mike Waychison

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

Oren Laadan wrote:
>
> Mike Waychison wrote:
>> Linus Torvalds wrote:
>>> On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote:
>>>
>>>> Ying Han [[email protected]] wrote:
>>>> | Hi Serge:
>>>> | I made a patch based on Oren's tree recently which implement a new
>>>> | syscall clone_with_pid. I tested with checkpoint/restart process tree
>>>> | and it works as expected.
>>>>
>>>> Yes, I think we had a version of clone() with pid a while ago.
>>> Are people _at_all_ thinking about security?
>>>
>>> Obviously not.
>>>
>>> There's no way we can do anything like this. Sure, it's trivial to do
>>> inside the kernel. But it also sounds like a _wonderful_ attack vector
>>> against badly written user-land software that sends signals and has small
>>> races.
>> I'm not really sure how this is different than a malicious app going off
>> and spawning thousands of threads in an attempt to hit a target pid from
>> a security pov. Sure, it makes it easier, but it's not like there is
>> anything in place to close the attack vector.
>>
>>> Quite frankly, from having followed the discussion(s) over the last few
>>> weeks about checkpoint/restart in various forms, my reaction to just about
>>> _all_ of this is that people pushing this are pretty damn borderline.
>>>
>>> I think you guys are working on all the wrong problems.
>>>
>>> Let's face it, we're not going to _ever_ checkpoint any kind of general
>>> case process. Just TCP makes that fundamentally impossible in the general
>>> case, and there are lots and lots of other cases too (just something as
>>> totally _trivial_ as all the files in the filesystem that don't get rolled
>>> back).
>> In some instances such as ours, TCP is probably the easiest thing to
>> migrate. In an rpc-based cluster application, TCP is nothing more than
>> an RPC channel and applications already have to handle RPC channel
>> failure and re-establishment.
>>
>> I agree that this is not the 'general case' as you mention above
>> however. This is the bit that sorta bothers me with the way the
>> implementation has been going so far on this list. The implementation
>> that folks are building on top of Oren's patchset tries to be everything
>> to everybody. For our purposes, we need to have the flexibility of
>> choosing *how* we checkpoint. The line seems to be arbitrarily drawn at
>> the kernel being responsible for checkpointing and restoring all
>> resources associated with a task, and leaving userland with nothing more
>> than transporting filesystem bits. This approach isn't flexible enough:
>> Consider the case where we want to stub out most of the TCP file
>> descriptors with ECONNRESETed sockets because we know that they are RPC
>> sockets and can re-establish themselves, but we want to use some other
>> mechanism for TCP sockets we don't know much about. The current
>> monolithic approach has zero flexibility for doing anything like this,
>> and I figure out how we could even fit anything like this in.
>
> The flexibility exists, but wasn't spelled out, so here it is:
>
> 1) Similar to madvice(), I envision a cradvice() that could tell the c/r
> something about specific resources, e.g.:
> * cradvice(CR_ADV_MEM, ptr, len) -> don't save that memory, it's scratch
> * cradvice(CR_ADV_SOCK, fd, CR_ADV_SOCK_RESET) -> reset connection on restart
> etc .. (nevermind the exact interface right now)
>
> 2) Tasks can ask to be notified (e.g. register a signal) when a checkpoint
> or a restart complete successfully. At that time they can do their private
> house-keeping if they know better.
>
> 3) If restoring some resource is significantly easier in user space (e.g. a
> file-descriptor of some special device which user space knows how to
> re-initialize), then the restarting task can prepare it ahead of time,
> and, call:
> * cradvice(CR_ADV_USERFD, fd, 0) -> use the fd in place instead of trying
> to restore it yourself.

This would be called by the embryo process (mktree.c?) before calling
sys_restart?

>
> Method #3 is what I used in Zap to implement distributed checkpoints, where
> it is so much easier to recreate all network connections in user space then
> putting that logic into the kernel.
>
> Now, on the other hand, doing the c/r from userland is much less flexible
> than in the kernel (e.g. epollfd, futex state and much more) and requires
> exposing tremendous amount of in-kernel data to user space. And we all know
> than exposing internals is always a one-way ticket :(
>
> [...]
>
> Oren.
>
>

2009-03-18 19:07:56

by Oren Laadan

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?



Mike Waychison wrote:
> Oren Laadan wrote:
>>
>> Mike Waychison wrote:
>>> Linus Torvalds wrote:
>>>> On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote:
>>>>
>>>>> Ying Han [[email protected]] wrote:
>>>>> | Hi Serge:
>>>>> | I made a patch based on Oren's tree recently which implement a new
>>>>> | syscall clone_with_pid. I tested with checkpoint/restart process
>>>>> tree
>>>>> | and it works as expected.
>>>>>
>>>>> Yes, I think we had a version of clone() with pid a while ago.
>>>> Are people _at_all_ thinking about security?
>>>>
>>>> Obviously not.
>>>>
>>>> There's no way we can do anything like this. Sure, it's trivial to
>>>> do inside the kernel. But it also sounds like a _wonderful_ attack
>>>> vector against badly written user-land software that sends signals
>>>> and has small races.
>>> I'm not really sure how this is different than a malicious app going
>>> off and spawning thousands of threads in an attempt to hit a target
>>> pid from a security pov. Sure, it makes it easier, but it's not like
>>> there is anything in place to close the attack vector.
>>>
>>>> Quite frankly, from having followed the discussion(s) over the last
>>>> few weeks about checkpoint/restart in various forms, my reaction to
>>>> just about _all_ of this is that people pushing this are pretty damn
>>>> borderline.
>>>> I think you guys are working on all the wrong problems.
>>>> Let's face it, we're not going to _ever_ checkpoint any kind of
>>>> general case process. Just TCP makes that fundamentally impossible
>>>> in the general case, and there are lots and lots of other cases too
>>>> (just something as totally _trivial_ as all the files in the
>>>> filesystem that don't get rolled back).
>>> In some instances such as ours, TCP is probably the easiest thing to
>>> migrate. In an rpc-based cluster application, TCP is nothing more
>>> than an RPC channel and applications already have to handle RPC
>>> channel failure and re-establishment.
>>>
>>> I agree that this is not the 'general case' as you mention above
>>> however. This is the bit that sorta bothers me with the way the
>>> implementation has been going so far on this list. The
>>> implementation that folks are building on top of Oren's patchset
>>> tries to be everything to everybody. For our purposes, we need to
>>> have the flexibility of choosing *how* we checkpoint. The line seems
>>> to be arbitrarily drawn at the kernel being responsible for
>>> checkpointing and restoring all resources associated with a task, and
>>> leaving userland with nothing more than transporting filesystem
>>> bits. This approach isn't flexible enough: Consider the case where
>>> we want to stub out most of the TCP file descriptors with
>>> ECONNRESETed sockets because we know that they are RPC sockets and
>>> can re-establish themselves, but we want to use some other mechanism
>>> for TCP sockets we don't know much about. The current monolithic
>>> approach has zero flexibility for doing anything like this, and I
>>> figure out how we could even fit anything like this in.
>>
>> The flexibility exists, but wasn't spelled out, so here it is:
>>
>> 1) Similar to madvice(), I envision a cradvice() that could tell the c/r
>> something about specific resources, e.g.:
>> * cradvice(CR_ADV_MEM, ptr, len) -> don't save that memory, it's
>> scratch
>> * cradvice(CR_ADV_SOCK, fd, CR_ADV_SOCK_RESET) -> reset connection
>> on restart
>> etc .. (nevermind the exact interface right now)
>>
>> 2) Tasks can ask to be notified (e.g. register a signal) when a
>> checkpoint
>> or a restart complete successfully. At that time they can do their
>> private
>> house-keeping if they know better.
>>
>> 3) If restoring some resource is significantly easier in user space
>> (e.g. a
>> file-descriptor of some special device which user space knows how to
>> re-initialize), then the restarting task can prepare it ahead of time,
>> and, call:
>> * cradvice(CR_ADV_USERFD, fd, 0) -> use the fd in place instead of
>> trying
>> to restore it yourself.
>
> This would be called by the embryo process (mktree.c?) before calling
> sys_restart?

Yes.

>
>>
>> Method #3 is what I used in Zap to implement distributed checkpoints,
>> where
>> it is so much easier to recreate all network connections in user space
>> then
>> putting that logic into the kernel.
>>
>> Now, on the other hand, doing the c/r from userland is much less flexible
>> than in the kernel (e.g. epollfd, futex state and much more) and requires
>> exposing tremendous amount of in-kernel data to user space. And we all
>> know
>> than exposing internals is always a one-way ticket :(
>>
>> [...]
>>
>> Oren.
>>
>>
>
>

2009-03-19 21:19:48

by Eric W. Biederman

[permalink] [raw]
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?

Ingo Molnar <[email protected]> writes:

> * Eric W. Biederman <[email protected]> wrote:
>
>> >> In the OpenVZ case, they've at least demonstrated that the
>> >> filesystem can be moved largely with rsync. Unlinked files
>> >> need some in-kernel TLC (or /proc mangling) but it isn't
>> >> *that* bad.
>> >
>> > And in the Zap we have successfully used a log-based
>> > filesystem (specifically NILFS) to continuously snapshot the
>> > file-system atomically with taking a checkpoint, so it can
>> > easily branch off past checkpoints, including the file
>> > system.
>> >
>> > And unlinked files can be (inefficiently) handled by saving
>> > their full contents with the checkpoint image - it's not a
>> > big toll on many apps (if you exclude Wine and UML...). At
>> > least that's a start.
>>
>> Oren we might want to do a proof of concept implementation
>> like I did with network namespaces. That is done in the
>> community and goes far enough to show we don't have horribly
>> nasty code. The patches and individual changes don't need to
>> be quite perfect but close enough that they can be considered
>> for merging.
>>
>> For the network namespace that seems to have made a big
>> difference.
>>
>> I'm afraid in our clean start we may have focused a little too
>> much on merging something simple and not gone far enough on
>> showing that things will work.
>>
>> After I had that in the network namespace and we had a clear
>> vision of the direction. We started merging the individual
>> patches and things went well.
>
> I'm curious: what is the actual end result other than good
> looking code? In terms of tangible benefits to the everyday
> Linux distro user. [This is not meant to be sarcastic, i'm
> truly curious.]

Of the network namespace? Sorry I'm not certain what you are asking.

Eric