2009-04-14 03:46:46

by Oren Laadan

[permalink] [raw]
Subject: Creating tasks on restart: userspace vs kernel


For checkpoint/restart (c/r) we need a method to (re)create the tasks
tree during restart. There are basically two approaches: in userspace
(zap approach) or in the kernel (openvz approach).

Once tasks have been created both approaches are similar in that all
restarting tasks end up calling the equivalent of "do_restart()" in
the kernel to perform the gory details of restoring its state.

In terms of performance, both approaches are similar, and both can
optimize to avoid duplicating resources unnecessarily during the
clone (e.g. mm, etc) knowing that they will be reconstructed soon
after.

So the question is what's better - user-space or kernel ?

Too bad that Alexey chose to ignore what's been discussed in
linux-containers mailing list in his recent post. Here is my take on
cons/pros.

Task creation in the kernel
---------------------------
* how: the user program calls sys_restart() which, for each task to
restore, creates a kernel thread which is demoted to a regular
process manually.

* pro: a single task that calls sys_restart()
* pro: restarting tasks are in full control of kernel at all times

* con: arch-dependent, harder to port across architectures
* con: can only restart a full container

Task creation in user space
---------------------------
* how: the user programs calls fork/clone to recreate a suitable
task tree in userspace, and each task calls sys_restart() to restore
its state; some kernel glue is necessary to synchronize restarting
tasks when in the kernel.

* pro: allows important flexibility during restart (see <1>)
* pro: code leverages existing well-understood syscalls (fork, clone)
* pro: allows restart of a only subtree (see <2>)

* con: requires a way to creates tasks with specific pid (see <3>)

<1> Flexibility:

In the spirit of madvise() that lets tasks advise the kernel because
they know better, there should be cradvise() for checkpoint/restart
purposes. During checkpoint it can tell the kernel "don't save this
piece of memory, it's scratch", or "ignore this file-descriptor" etc.
During restart, it will can tell the kernel "use this file-descriptor"
or "use this network namespace" (instead of trying to restore).

Offering cradvise() capability during restart is especially important
in cases where the kernel (inevitably) won't know how to restore a
resource (e.g. think special devices), when the application wants to
override (e.g. think of a c/r aware server that would like to change
the port on which it is listening), or when it's that much simpler to
do it in userspace (e.g. think setting up network namespaces).

Another important example is distributed checkpoint, where the
restarting tasks could (re)create all their network connections in
user space, before invoking sys_restart() and tell the kernel, via
cradvise(), to use the newly created sockets.

The need for this sort of flexibility has been stressed multiple times
and by multiple stake-holders interested in checkpoint/restart.

<2> Restarting a subtree:

The primary c/r effort is directed towards providing c/r functionality
for containers.

Wouldn't it be nice if, while doing so and at minimal added effort, we
also gain a method to checkpoint and restart an arbitrary subtree of
tasks, which isn't necessarily an entire container ?

Sure, it will be more constrained (e.g. resulting pid in restart won't
match the original pids), and won't work for all applications. But it
will still be a useful tool for many use cases, like batch cpu jobs,
some servers, vnc sessions (if you want graphics) etc. Imagine you run
'octave' for a week and must reboot now - 'octave' wouldn't care if
you checkpointed it and then restart with a different pid !

<3> Clone with pid:

To restart processes from userspace, there needs to be a way to
request a specific pid--in the current pid_ns--for the child process
(clearly, if it isn't in use).

Why is it a disadvantage ? to Linus, a syscall clone_with_pid()
"sounds like a _wonderful_ attack vector against badly written
user-land software...". Actually, getting a specific pid is possible
without this syscall. But the point is that it's undesirable to have
this functionality unrestricted.

So one option is to require root privileges. Another option is to
restrict such action in pid_ns created by the same user. Even more so,
restrict to only containers that are being restarted.

---

Either way we go, it should be fairly easy to switch from one method
to the other, should we need to.

All in all, there isn't a strong reason in favor of kernel method.

In contrast, it's at least as simple in userspace (reusing existing
syscalls). More importantly, the flexibility that we gain with restart
of tasks in userspace, no cost incurred (in terms of implementation or
runtime overhead).

Oren.


2009-04-14 10:00:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: Creating tasks on restart: userspace vs kernel


* Oren Laadan <[email protected]> wrote:

> <3> Clone with pid:
>
> To restart processes from userspace, there needs to be a way to
> request a specific pid--in the current pid_ns--for the child
> process (clearly, if it isn't in use).
>
> Why is it a disadvantage ? to Linus, a syscall clone_with_pid()
> "sounds like a _wonderful_ attack vector against badly written
> user-land software...". Actually, getting a specific pid is
> possible without this syscall. But the point is that it's
> undesirable to have this functionality unrestricted.

The point is that there's a class of a difference between a racy and
unreliable method of 'create tens of thousands of tasks to steal the
right PID you are interested in' and a built-in syscall that gives
this within a couple of microseconds.

Most signal races are timing dependent so the ability to do it
really quickly makes or breaks the practicality of many classes of
exploits.

> So one option is to require root privileges. Another option is to
> restrict such action in pid_ns created by the same user. Even more
> so, restrict to only containers that are being restarted.

Requiring root privileges seems to remove much of the appeal of
allowing this to be a more generic sub-container creation thing. If
regular unprivileged apps cannot use this to save/restore their own
local task hierarchy, the whole thing becomes rather pointless,
right?

Ingo

2009-04-14 14:57:20

by Oren Laadan

[permalink] [raw]
Subject: Re: Creating tasks on restart: userspace vs kernel



Ingo Molnar wrote:
> * Oren Laadan <[email protected]> wrote:
>
>> <3> Clone with pid:
>>
>> To restart processes from userspace, there needs to be a way to
>> request a specific pid--in the current pid_ns--for the child
>> process (clearly, if it isn't in use).
>>
>> Why is it a disadvantage ? to Linus, a syscall clone_with_pid()
>> "sounds like a _wonderful_ attack vector against badly written
>> user-land software...". Actually, getting a specific pid is
>> possible without this syscall. But the point is that it's
>> undesirable to have this functionality unrestricted.
>
> The point is that there's a class of a difference between a racy and
> unreliable method of 'create tens of thousands of tasks to steal the
> right PID you are interested in' and a built-in syscall that gives
> this within a couple of microseconds.
>
> Most signal races are timing dependent so the ability to do it
> really quickly makes or breaks the practicality of many classes of
> exploits.

Exactly.

>
>> So one option is to require root privileges. Another option is to
>> restrict such action in pid_ns created by the same user. Even more
>> so, restrict to only containers that are being restarted.
>
> Requiring root privileges seems to remove much of the appeal of
> allowing this to be a more generic sub-container creation thing. If
> regular unprivileged apps cannot use this to save/restore their own
> local task hierarchy, the whole thing becomes rather pointless,
> right?

First, I suggest to distinguish between two cases: (1) c/r of a whole
container, and (2) c/r of a task subtree. (#2 is a nice byproduct of
this work, but with more limited scope/applicability).

#2 is easier: we don't use a new ipc_ns necessarily, so we don't need
to (and perhaps can't) restore old pids. So there is no question about
privileges. (This of course requires that the application be c/r-aware
or c/r-agnostic).

For #1, we need to create a new container to begin with. This already
requires CAP_SYS_ADMIN. Yes, for now we can use some setuid() to create
a new pid_ns and then do the restart.

We will eventually need CAP_SYS_ADMIN for other parts of the restart,
for instance to restore a listening socket on a privileged port, or to
restore tasks of multiple users, or to restore an open file accessible
by, say, root only (assume the original task opened the file and then
dropped its privileges).

So for c/r - eventually we'll need to trust something in the checkpoint
image, like you trust a kernel module. One way to do it is to have the
userland utility (particularly restart) setuid, and have it sign the
image during checkpoint and then verify the signature during restart.

Oren.

2009-04-14 16:16:37

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: Creating tasks on restart: userspace vs kernel

Quoting Oren Laadan ([email protected]):
> For #1, we need to create a new container to begin with. This already
> requires CAP_SYS_ADMIN. Yes, for now we can use some setuid() to create
> a new pid_ns and then do the restart.

This is why I like tagging a pidns with a userid, and requiring that
current->euid==pidns->uid in order to be allowed to set pid in that
pidns.

We require cap_sys_admin wil doing clone(CLONE_NEWPID). So if we do
that while uid=500, then drop cap_sys_admin, then we can proceed to
create new tasks with specified pids in that pidns.

-serge

2009-04-14 16:36:34

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: Creating tasks on restart: userspace vs kernel

On Mon, Apr 13, 2009 at 11:43:30PM -0400, Oren Laadan wrote:
> For checkpoint/restart (c/r) we need a method to (re)create the tasks
> tree during restart. There are basically two approaches: in userspace
> (zap approach) or in the kernel (openvz approach).
>
> Once tasks have been created both approaches are similar in that all
> restarting tasks end up calling the equivalent of "do_restart()" in
> the kernel to perform the gory details of restoring its state.
>
> In terms of performance, both approaches are similar, and both can
> optimize to avoid duplicating resources unnecessarily during the
> clone (e.g. mm, etc) knowing that they will be reconstructed soon
> after.
>
> So the question is what's better - user-space or kernel ?
>
> Too bad that Alexey chose to ignore what's been discussed in
> linux-containers mailing list in his recent post. Here is my take on
> cons/pros.
>
> Task creation in the kernel
> ---------------------------
> * how: the user program calls sys_restart() which, for each task to
> restore, creates a kernel thread which is demoted to a regular
> process manually.
>
> * pro: a single task that calls sys_restart()
> * pro: restarting tasks are in full control of kernel at all times
>
> * con: arch-dependent, harder to port across architectures

Not a "con" at all.

For the usage purposes, kernel_thread() is arch-independent.
Filesystems create kernel threads and not have a single bit of
arch-specific code.

> * con: can only restart a full container

This is by design.

Granularity of whole damn thing is one container both on checkpoint and
restart.

You want to chop pieces, fine, do surgery on _image_.

> Task creation in user space
> ---------------------------
> * how: the user programs calls fork/clone to recreate a suitable
> task tree in userspace, and each task calls sys_restart() to restore
> its state; some kernel glue is necessary to synchronize restarting
> tasks when in the kernel.

> * pro: allows important flexibility during restart (see <1>)
> * pro: code leverages existing well-understood syscalls (fork, clone)

kernel_thread() is effectively clone(2).

> * pro: allows restart of a only subtree (see <2>)
>
> * con: requires a way to creates tasks with specific pid (see <3>)
>
> <1> Flexibility:
>
> In the spirit of madvise() that lets tasks advise the kernel because
> they know better, there should be cradvise() for checkpoint/restart
> purposes. During checkpoint it can tell the kernel "don't save this
> piece of memory, it's scratch", or "ignore this file-descriptor" etc.
> During restart, it will can tell the kernel "use this file-descriptor"
> or "use this network namespace" (instead of trying to restore).
>
> Offering cradvise() capability during restart is especially important
> in cases where the kernel (inevitably) won't know how to restore a
> resource (e.g. think special devices), when the application wants to
> override (e.g. think of a c/r aware server that would like to change
> the port on which it is listening), or when it's that much simpler to
> do it in userspace (e.g. think setting up network namespaces).
>
> Another important example is distributed checkpoint, where the
> restarting tasks could (re)create all their network connections in
> user space, before invoking sys_restart() and tell the kernel, via
> cradvise(), to use the newly created sockets.
>
> The need for this sort of flexibility has been stressed multiple times
> and by multiple stake-holders interested in checkpoint/restart.
>
> <2> Restarting a subtree:
>
> The primary c/r effort is directed towards providing c/r functionality
> for containers.
>
> Wouldn't it be nice if, while doing so and at minimal added effort, we
> also gain a method to checkpoint and restart an arbitrary subtree of
> tasks, which isn't necessarily an entire container ?

Do this in userspace.

> Sure, it will be more constrained (e.g. resulting pid in restart won't
> match the original pids), and won't work for all applications.

Given correctly written image chopper, all pids will be fine and
correctness will be bounded by how good user understands what can be
chopped and can't.

Besides, if such chopper can only chop task_struct's, you'll get correct
image.

In the end correctness of chopping will be equal to how good user
understands that two task_struct's are independent of each other.

> But it will still be a useful tool for many use cases, like batch cpu jobs,
> some servers, vnc sessions (if you want graphics) etc. Imagine you run
> 'octave' for a week and must reboot now - 'octave' wouldn't care if
> you checkpointed it and then restart with a different pid !
>
> <3> Clone with pid:
>
> To restart processes from userspace, there needs to be a way to
> request a specific pid--in the current pid_ns--for the child process
> (clearly, if it isn't in use).
>
> Why is it a disadvantage ? to Linus, a syscall clone_with_pid()
> "sounds like a _wonderful_ attack vector against badly written
> user-land software...". Actually, getting a specific pid is possible
> without this syscall. But the point is that it's undesirable to have
> this functionality unrestricted.
>
> So one option is to require root privileges. Another option is to
> restrict such action in pid_ns created by the same user. Even more so,
> restrict to only containers that are being restarted.

You want to do small part in userspace and consequently end up with hacks
both userspace-visible and in-kernel.

Pids aren't special, they are struct pid, dynamically allocated and
refcounted just like any other structtures.

They _become_ special for you intended method of restart.

You also have flags in nsproxy image (or where?) like "do clone with
CLONE_NEWUTS".

This is unneeded!

nsproxy (or task_struct) image have reference (objref/position) to uts_ns image.

On restart, one lookups object by reference or restore it if needed,
takes refcount and glue. Just like with every other two structures.

No "what to do, what to do" logic.

> Either way we go, it should be fairly easy to switch from one method
> to the other, should we need to.
>
> All in all, there isn't a strong reason in favor of kernel method.
>
> In contrast, it's at least as simple in userspace (reusing existing
> syscalls). More importantly, the flexibility that we gain with restart
> of tasks in userspace, no cost incurred (in terms of implementation or
> runtime overhead).

Regarding of who should orchestrate restart(2).

Special process who calls restart(2) should do it. It doesn't relate to
restarted process at all. It isn't, for example, init of future container.

Reasons:
1) somebody should write registers before final jump to userspace.
Task itself can't generally do it: struct pt_regs is in the same place
as kernel stack.

cr_load_cpu_regs() does exactly this: as current writes to it's own
pt_regs. Oren, why don't you see crashes?

I first tried to do it and was greeted with horrible crashes because
e.g current becoming NULL under current. That's why
cr_arch_restore_task_struct() is not done in current context.

2) Somebody should restore who is parent of whom, who ptraces whom,
resolved all possible loops and so on. Intuition tells me that it
should be context which is not involved into restart(2) other than
doing this post-restart-before-thaw part.

Consequently, it should not be init of futire container. It's just
another task after all, from the POV of reparenting code.

2009-04-14 16:46:24

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: Creating tasks on restart: userspace vs kernel

> 1) somebody should write registers before final jump to userspace.
> Task itself can't generally do it: struct pt_regs is in the same place
> as kernel stack.
>
> cr_load_cpu_regs() does exactly this: as current writes to it's own
> pt_regs. Oren, why don't you see crashes?
>
> I first tried to do it and was greeted with horrible crashes because
> e.g current becoming NULL under current. That's why
> cr_arch_restore_task_struct() is not done in current context.

Hmm, this must an artefact of kernel_thread() approach.

2009-04-14 18:43:38

by Oren Laadan

[permalink] [raw]
Subject: Re: Creating tasks on restart: userspace vs kernel



Alexey Dobriyan wrote:
> On Mon, Apr 13, 2009 at 11:43:30PM -0400, Oren Laadan wrote:
>> For checkpoint/restart (c/r) we need a method to (re)create the tasks
>> tree during restart. There are basically two approaches: in userspace
>> (zap approach) or in the kernel (openvz approach).
>>
>> Once tasks have been created both approaches are similar in that all
>> restarting tasks end up calling the equivalent of "do_restart()" in
>> the kernel to perform the gory details of restoring its state.
>>
>> In terms of performance, both approaches are similar, and both can
>> optimize to avoid duplicating resources unnecessarily during the
>> clone (e.g. mm, etc) knowing that they will be reconstructed soon
>> after.
>>
>> So the question is what's better - user-space or kernel ?
>>
>> Too bad that Alexey chose to ignore what's been discussed in
>> linux-containers mailing list in his recent post. Here is my take on
>> cons/pros.
>>
>> Task creation in the kernel
>> ---------------------------
>> * how: the user program calls sys_restart() which, for each task to
>> restore, creates a kernel thread which is demoted to a regular
>> process manually.
>>
>> * pro: a single task that calls sys_restart()
>> * pro: restarting tasks are in full control of kernel at all times
>>
>> * con: arch-dependent, harder to port across architectures
>
> Not a "con" at all.
>
> For the usage purposes, kernel_thread() is arch-independent.
> Filesystems create kernel threads and not have a single bit of
> arch-specific code.

My bad, I was relying on the older patchset submitted by Andrey.

>
>> * con: can only restart a full container
>
> This is by design.
>
> Granularity of whole damn thing is one container both on checkpoint and
> restart.
>
> You want to chop pieces, fine, do surgery on _image_.

I challenged that design decision already :)

Again, so to checkpoint one task in the topmost pid-ns you need to
checkpoint (if at all possible) the entire system ?!

>
>> Task creation in user space
>> ---------------------------
>> * how: the user programs calls fork/clone to recreate a suitable
>> task tree in userspace, and each task calls sys_restart() to restore
>> its state; some kernel glue is necessary to synchronize restarting
>> tasks when in the kernel.
>
>> * pro: allows important flexibility during restart (see <1>)
>> * pro: code leverages existing well-understood syscalls (fork, clone)
>
> kernel_thread() is effectively clone(2).

By "leverage" I mean that no hand-crafting is needed later - where if
you use kernel thread you need to convert it to a process, reparent it
etc. The more we rely on existing code, the faster the path to enter
mainline kernel, the less maintenance in the future, and the less
likely it breaks due to other kernel changes.

>
>> * pro: allows restart of a only subtree (see <2>)
>>
>> * con: requires a way to creates tasks with specific pid (see <3>)
>>
>> <1> Flexibility:
>>
>> In the spirit of madvise() that lets tasks advise the kernel because
>> they know better, there should be cradvise() for checkpoint/restart
>> purposes. During checkpoint it can tell the kernel "don't save this
>> piece of memory, it's scratch", or "ignore this file-descriptor" etc.
>> During restart, it will can tell the kernel "use this file-descriptor"
>> or "use this network namespace" (instead of trying to restore).
>>
>> Offering cradvise() capability during restart is especially important
>> in cases where the kernel (inevitably) won't know how to restore a
>> resource (e.g. think special devices), when the application wants to
>> override (e.g. think of a c/r aware server that would like to change
>> the port on which it is listening), or when it's that much simpler to
>> do it in userspace (e.g. think setting up network namespaces).
>>
>> Another important example is distributed checkpoint, where the
>> restarting tasks could (re)create all their network connections in
>> user space, before invoking sys_restart() and tell the kernel, via
>> cradvise(), to use the newly created sockets.
>>
>> The need for this sort of flexibility has been stressed multiple times
>> and by multiple stake-holders interested in checkpoint/restart.
>>
>> <2> Restarting a subtree:
>>
>> The primary c/r effort is directed towards providing c/r functionality
>> for containers.
>>
>> Wouldn't it be nice if, while doing so and at minimal added effort, we
>> also gain a method to checkpoint and restart an arbitrary subtree of
>> tasks, which isn't necessarily an entire container ?
>
> Do this in userspace.
>
>> Sure, it will be more constrained (e.g. resulting pid in restart won't
>> match the original pids), and won't work for all applications.
>
> Given correctly written image chopper, all pids will be fine and
> correctness will be bounded by how good user understands what can be
> chopped and can't.
>
> Besides, if such chopper can only chop task_struct's, you'll get correct
> image.

See above.

>
> In the end correctness of chopping will be equal to how good user
> understands that two task_struct's are independent of each other.
>
>> But it will still be a useful tool for many use cases, like batch cpu jobs,
>> some servers, vnc sessions (if you want graphics) etc. Imagine you run
>> 'octave' for a week and must reboot now - 'octave' wouldn't care if
>> you checkpointed it and then restart with a different pid !
>>
>> <3> Clone with pid:
>>
>> To restart processes from userspace, there needs to be a way to
>> request a specific pid--in the current pid_ns--for the child process
>> (clearly, if it isn't in use).
>>
>> Why is it a disadvantage ? to Linus, a syscall clone_with_pid()
>> "sounds like a _wonderful_ attack vector against badly written
>> user-land software...". Actually, getting a specific pid is possible
>> without this syscall. But the point is that it's undesirable to have
>> this functionality unrestricted.
>>
>> So one option is to require root privileges. Another option is to
>> restrict such action in pid_ns created by the same user. Even more so,
>> restrict to only containers that are being restarted.
>
> You want to do small part in userspace and consequently end up with hacks
> both userspace-visible and in-kernel.

I want to extend existing kernel interface to leverage fork/clone
from user space, AND to allow the flexibility mentioned above (which
you conveniently ignored).

All hacks are in-kernel, aren't they ?

As for asking for a specific pid from user space, it can be done by:
* a new syscall (restricted to user-owned-namespace or CAP_SYS_ADMIN)
* a sys_restart(... SET_NEXT_PID) interface specific for restart (ugh)
* setting a special /proc/PID/next_id file which is consulted by fork

and in all cases, limit this so it can only allowed in a restarting
container, under the proper security model (again, e.g., Serge's
suggestion).

>
> Pids aren't special, they are struct pid, dynamically allocated and
> refcounted just like any other structtures.
>
> They _become_ special for you intended method of restart.

They are special. And I allow them not to be restored, as well, if
the use case so wishes.

>
> You also have flags in nsproxy image (or where?) like "do clone with
> CLONE_NEWUTS".

Nope. Read the code.

>
> This is unneeded!
>
> nsproxy (or task_struct) image have reference (objref/position) to uts_ns image.
>
> On restart, one lookups object by reference or restore it if needed,
> takes refcount and glue. Just like with every other two structures.

That's exactly how it's done.

>
> No "what to do, what to do" logic.
>
>> Either way we go, it should be fairly easy to switch from one method
>> to the other, should we need to.
>>
>> All in all, there isn't a strong reason in favor of kernel method.
>>
>> In contrast, it's at least as simple in userspace (reusing existing
>> syscalls). More importantly, the flexibility that we gain with restart
>> of tasks in userspace, no cost incurred (in terms of implementation or
>> runtime overhead).
>
> Regarding of who should orchestrate restart(2).
>
> Special process who calls restart(2) should do it. It doesn't relate to
> restarted process at all. It isn't, for example, init of future container.
>

Could also be (new) container init. The parent of that process will
figure out the status of the operation (success/fail) and report.
If any of the actual restarting tasks crashed/segfaults - very well,
the parent will hide that from user and report 'failure'.

> Reasons:
> 1) somebody should write registers before final jump to userspace.
> Task itself can't generally do it: struct pt_regs is in the same place
> as kernel stack.
>
> cr_load_cpu_regs() does exactly this: as current writes to it's own
> pt_regs. Oren, why don't you see crashes?

LOL :)

Maybe because it works ?

>
> I first tried to do it and was greeted with horrible crashes because
> e.g current becoming NULL under current. That's why
> cr_arch_restore_task_struct() is not done in current context.
>
> 2) Somebody should restore who is parent of whom, who ptraces whom,
> resolved all possible loops and so on. Intuition tells me that it
> should be context which is not involved into restart(2) other than
> doing this post-restart-before-thaw part.

Who is the parent of whom - determined by the fork/clone order.

No loops, because when restarts actually starts (that is, in the
kernel), all tasks have already been created, so they are readily
available.

ptrace property - (will be) restored as part of the regular task
restore for each task, in order. Within each task, it will be one
of the last things the task does to avoid generating spurious ptrace
events while restarting.

>
> Consequently, it should not be init of futire container. It's just
> another task after all, from the POV of reparenting code.
>

Not at all necessary, as explained above.

There is, however, a need for a parent task that will monitor the
operation and report success/failure to user.

Oren.

2009-04-14 19:59:09

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: Creating tasks on restart: userspace vs kernel

> > In the end correctness of chopping will be equal to how good user
> > understands that two task_struct's are independent of each other.
> >
> >> But it will still be a useful tool for many use cases, like batch cpu jobs,
> >> some servers, vnc sessions (if you want graphics) etc. Imagine you run
> >> 'octave' for a week and must reboot now - 'octave' wouldn't care if
> >> you checkpointed it and then restart with a different pid !
> >>
> >> <3> Clone with pid:
> >>
> >> To restart processes from userspace, there needs to be a way to
> >> request a specific pid--in the current pid_ns--for the child process
> >> (clearly, if it isn't in use).
> >>
> >> Why is it a disadvantage ? to Linus, a syscall clone_with_pid()
> >> "sounds like a _wonderful_ attack vector against badly written
> >> user-land software...". Actually, getting a specific pid is possible
> >> without this syscall. But the point is that it's undesirable to have
> >> this functionality unrestricted.
> >>
> >> So one option is to require root privileges. Another option is to
> >> restrict such action in pid_ns created by the same user. Even more so,
> >> restrict to only containers that are being restarted.
> >
> > You want to do small part in userspace and consequently end up with hacks
> > both userspace-visible and in-kernel.
>
> I want to extend existing kernel interface to leverage fork/clone
> from user space, AND to allow the flexibility mentioned above (which
> you conveniently ignored).
>
> All hacks are in-kernel, aren't they ?

mktree.c can be vieved as hack, why not?

The whole existence of these requirements. You want new syscall or SET_NEX_PID
or /proc file or something.

> As for asking for a specific pid from user space, it can be done by:
> * a new syscall (restricted to user-owned-namespace or CAP_SYS_ADMIN)
> * a sys_restart(... SET_NEXT_PID) interface specific for restart (ugh)
> * setting a special /proc/PID/next_id file which is consulted by fork

/proc/*/next_id was disscussed and hopefully died, but no.

> and in all cases, limit this so it can only allowed in a restarting
> container, under the proper security model (again, e.g., Serge's
> suggestion).
>
> >
> > Pids aren't special, they are struct pid, dynamically allocated and
> > refcounted just like any other structtures.
> >
> > They _become_ special for you intended method of restart.
>
> They are special. And I allow them not to be restored, as well, if
> the use case so wishes.

The use case is to restore as much as possible to the same state as
equal as possible. Not going with fork_with_pid() in any form helps
kernel to ensure correctness of restore and helps to avoid surprise
failure modes from user POV.

> > You also have flags in nsproxy image (or where?) like "do clone with
> > CLONE_NEWUTS".
>
> Nope. Read the code.

Which code?

static int cr_write_namespaces(struct cr_ctx *ctx, struct task_struct *t)
{
...

new_uts = cr_obj_add_ptr(ctx, nsproxy->uts_ns,
&hh->uts_ref, CR_OBJ_UTSNS, 0);
if (new_uts < 0) {
ret = new_uts;
goto out;
}

hh->flags = 0;
if (new_uts)
===> hh->flags |= CLONE_NEWUTS;

ret = cr_write_obj(ctx, &h, hh);
...

> > This is unneeded!
> >
> > nsproxy (or task_struct) image have reference (objref/position) to uts_ns image.
> >
> > On restart, one lookups object by reference or restore it if needed,
> > takes refcount and glue. Just like with every other two structures.
>
> That's exactly how it's done.

Not for uts_ns and future namespaces.

ret = cr_restore_utsns(ctx, hh->uts_ref, hh->flags);
^^^^^^^^^
comes from disk

> > No "what to do, what to do" logic.

2009-04-14 20:14:01

by Oren Laadan

[permalink] [raw]
Subject: Re: Creating tasks on restart: userspace vs kernel



Alexey Dobriyan wrote:
>>> In the end correctness of chopping will be equal to how good user
>>> understands that two task_struct's are independent of each other.
>>>
>>>> But it will still be a useful tool for many use cases, like batch cpu jobs,
>>>> some servers, vnc sessions (if you want graphics) etc. Imagine you run
>>>> 'octave' for a week and must reboot now - 'octave' wouldn't care if
>>>> you checkpointed it and then restart with a different pid !
>>>>
>>>> <3> Clone with pid:
>>>>
>>>> To restart processes from userspace, there needs to be a way to
>>>> request a specific pid--in the current pid_ns--for the child process
>>>> (clearly, if it isn't in use).
>>>>
>>>> Why is it a disadvantage ? to Linus, a syscall clone_with_pid()
>>>> "sounds like a _wonderful_ attack vector against badly written
>>>> user-land software...". Actually, getting a specific pid is possible
>>>> without this syscall. But the point is that it's undesirable to have
>>>> this functionality unrestricted.
>>>>
>>>> So one option is to require root privileges. Another option is to
>>>> restrict such action in pid_ns created by the same user. Even more so,
>>>> restrict to only containers that are being restarted.
>>> You want to do small part in userspace and consequently end up with hacks
>>> both userspace-visible and in-kernel.
>> I want to extend existing kernel interface to leverage fork/clone
>> from user space, AND to allow the flexibility mentioned above (which
>> you conveniently ignored).
>>
>> All hacks are in-kernel, aren't they ?
>
> mktree.c can be vieved as hack, why not?

Lol .. I meant "all kernel hacks are in-kernel" :)

>
> The whole existence of these requirements. You want new syscall or SET_NEX_PID
> or /proc file or something.

Or embed it into a restart(2) call with special argument.

>
>> As for asking for a specific pid from user space, it can be done by:
>> * a new syscall (restricted to user-owned-namespace or CAP_SYS_ADMIN)
>> * a sys_restart(... SET_NEXT_PID) interface specific for restart (ugh)
>> * setting a special /proc/PID/next_id file which is consulted by fork
>
> /proc/*/next_id was disscussed and hopefully died, but no.
>
>> and in all cases, limit this so it can only allowed in a restarting
>> container, under the proper security model (again, e.g., Serge's
>> suggestion).
>>
>>> Pids aren't special, they are struct pid, dynamically allocated and
>>> refcounted just like any other structtures.
>>>
>>> They _become_ special for you intended method of restart.
>> They are special. And I allow them not to be restored, as well, if
>> the use case so wishes.
>
> The use case is to restore as much as possible to the same state as
> equal as possible. Not going with fork_with_pid() in any form helps
> kernel to ensure correctness of restore and helps to avoid surprise
> failure modes from user POV.
>
>>> You also have flags in nsproxy image (or where?) like "do clone with
>>> CLONE_NEWUTS".
>> Nope. Read the code.
>
> Which code?
>
> static int cr_write_namespaces(struct cr_ctx *ctx, struct task_struct *t)
> {
> ...
>
> new_uts = cr_obj_add_ptr(ctx, nsproxy->uts_ns,
> &hh->uts_ref, CR_OBJ_UTSNS, 0);
> if (new_uts < 0) {
> ret = new_uts;
> goto out;
> }
>
> hh->flags = 0;
> if (new_uts)
> ===> hh->flags |= CLONE_NEWUTS;
>
> ret = cr_write_obj(ctx, &h, hh);
> ...
>
>>> This is unneeded!
>>>
>>> nsproxy (or task_struct) image have reference (objref/position) to uts_ns image.
>>>
>>> On restart, one lookups object by reference or restore it if needed,
>>> takes refcount and glue. Just like with every other two structures.
>> That's exactly how it's done.
>
> Not for uts_ns and future namespaces.
>
> ret = cr_restore_utsns(ctx, hh->uts_ref, hh->flags);
> ^^^^^^^^^
> comes from disk

Where else would it come from ? that's part of the state saved during
checkpoint.

That's for nested UTS namespaces, where a task in container called
unshare().

Oren.

2009-04-14 21:01:48

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: Creating tasks on restart: userspace vs kernel

On Tue, Apr 14, 2009 at 04:10:53PM -0400, Oren Laadan wrote:
>
>
> Alexey Dobriyan wrote:
> >>> In the end correctness of chopping will be equal to how good user
> >>> understands that two task_struct's are independent of each other.
> >>>
> >>>> But it will still be a useful tool for many use cases, like batch cpu jobs,
> >>>> some servers, vnc sessions (if you want graphics) etc. Imagine you run
> >>>> 'octave' for a week and must reboot now - 'octave' wouldn't care if
> >>>> you checkpointed it and then restart with a different pid !
> >>>>
> >>>> <3> Clone with pid:
> >>>>
> >>>> To restart processes from userspace, there needs to be a way to
> >>>> request a specific pid--in the current pid_ns--for the child process
> >>>> (clearly, if it isn't in use).
> >>>>
> >>>> Why is it a disadvantage ? to Linus, a syscall clone_with_pid()
> >>>> "sounds like a _wonderful_ attack vector against badly written
> >>>> user-land software...". Actually, getting a specific pid is possible
> >>>> without this syscall. But the point is that it's undesirable to have
> >>>> this functionality unrestricted.
> >>>>
> >>>> So one option is to require root privileges. Another option is to
> >>>> restrict such action in pid_ns created by the same user. Even more so,
> >>>> restrict to only containers that are being restarted.
> >>> You want to do small part in userspace and consequently end up with hacks
> >>> both userspace-visible and in-kernel.
> >> I want to extend existing kernel interface to leverage fork/clone
> >> from user space, AND to allow the flexibility mentioned above (which
> >> you conveniently ignored).
> >>
> >> All hacks are in-kernel, aren't they ?
> >
> > mktree.c can be vieved as hack, why not?
>
> Lol .. I meant "all kernel hacks are in-kernel" :)
>
> >
> > The whole existence of these requirements. You want new syscall or SET_NEX_PID
> > or /proc file or something.
>
> Or embed it into a restart(2) call with special argument.
>
> >
> >> As for asking for a specific pid from user space, it can be done by:
> >> * a new syscall (restricted to user-owned-namespace or CAP_SYS_ADMIN)
> >> * a sys_restart(... SET_NEXT_PID) interface specific for restart (ugh)
> >> * setting a special /proc/PID/next_id file which is consulted by fork
> >
> > /proc/*/next_id was disscussed and hopefully died, but no.
> >
> >> and in all cases, limit this so it can only allowed in a restarting
> >> container, under the proper security model (again, e.g., Serge's
> >> suggestion).
> >>
> >>> Pids aren't special, they are struct pid, dynamically allocated and
> >>> refcounted just like any other structtures.
> >>>
> >>> They _become_ special for you intended method of restart.
> >> They are special. And I allow them not to be restored, as well, if
> >> the use case so wishes.
> >
> > The use case is to restore as much as possible to the same state as
> > equal as possible. Not going with fork_with_pid() in any form helps
> > kernel to ensure correctness of restore and helps to avoid surprise
> > failure modes from user POV.
> >
> >>> You also have flags in nsproxy image (or where?) like "do clone with
> >>> CLONE_NEWUTS".
> >> Nope. Read the code.
> >
> > Which code?
> >
> > static int cr_write_namespaces(struct cr_ctx *ctx, struct task_struct *t)
> > {
> > ...
> >
> > new_uts = cr_obj_add_ptr(ctx, nsproxy->uts_ns,
> > &hh->uts_ref, CR_OBJ_UTSNS, 0);
> > if (new_uts < 0) {
> > ret = new_uts;
> > goto out;
> > }
> >
> > hh->flags = 0;
> > if (new_uts)
> > ===> hh->flags |= CLONE_NEWUTS;
> >
> > ret = cr_write_obj(ctx, &h, hh);
> > ...
> >
> >>> This is unneeded!
> >>>
> >>> nsproxy (or task_struct) image have reference (objref/position) to uts_ns image.
> >>>
> >>> On restart, one lookups object by reference or restore it if needed,
> >>> takes refcount and glue. Just like with every other two structures.
> >> That's exactly how it's done.
> >
> > Not for uts_ns and future namespaces.
> >
> > ret = cr_restore_utsns(ctx, hh->uts_ref, hh->flags);
> > ^^^^^^^^^
> > comes from disk
>
> Where else would it come from ? that's part of the state saved during
> checkpoint.

This is bogus part saved during checkpoint.

To restore nsproxy you only need references to uts_ns, ipc_ns, mnt_ns,
pid_ns and net_nsm, no flags.

You can try to be smart and, consequently, end up with checks where
a) flags tell to unshare uts_ns, but reference is the same, and
b) flags don't tell to unshare, but reference is different.

This is unneeded code coming from the way you restore nsproxy
incorrectly.

> That's for nested UTS namespaces,

Just to clear terminology, UTS namespaces aren't nested, only
PID namespaces are.

> where a task in container called unshare().

2009-04-15 19:56:29

by Alexey Dobriyan

[permalink] [raw]
Subject: C/R without "leaks" (was: Re: Creating tasks on restart: userspace vs kernel)

> Again, so to checkpoint one task in the topmost pid-ns you need to
> checkpoint (if at all possible) the entire system ?!

One more argument to not allow "leaks" and checkpoint whole container,
no ifs, buts and woulditbenices.

Just to clarify, C/R with "leak" is for example when process has separate
pidns, but shares, for example, netns with other process not involved in
checkpoint.

If you allow this, you lose one important property of checkpoint part,
namely, almost everything is frozen. Losing this property means suddenly
much more stuff is alive during dump and you has to account to more stuff
when checkpointing. You effectively checkpointing on live data structures
and there is no guarantee you'll get it right.

Example 1: utsns is shared with the rest of the world.

utsns content is modifiable only by tasks (current->nsproxy->uts_ns).
Consequently, someone can modify utsns content while you're dumping it
if you allow "leaks".

Did you take precautions? Where?

static int cr_write_utsns(struct cr_ctx *ctx, struct uts_namespace *uts_ns)
{
struct cr_hdr h;
struct cr_hdr_utsns *hh;
int domainname_len;
int nodename_len;
int ret;

h.type = CR_HDR_UTSNS;
h.len = sizeof(*hh);

hh = cr_hbuf_get(ctx, sizeof(*hh));
if (!hh)
return -ENOMEM;

nodename_len = strlen(uts_ns->name.nodename) + 1;
domainname_len = strlen(uts_ns->name.domainname) + 1;

hh->nodename_len = nodename_len;
hh->domainname_len = domainname_len;

ret = cr_write_obj(ctx, &h, hh);
cr_hbuf_put(ctx, sizeof(*hh));
if (ret < 0)
return ret;

ret = cr_write_string(ctx, uts_ns->name.nodename, nodename_len);
if (ret < 0)
return ret;

ret = cr_write_string(ctx, uts_ns->name.domainname, domainname_len);
return ret;
}

You should take uts_sem.


Example 2: ipcns is shared with the rest of the world

Consequently, shm segment is visible outside and live. Someone already
shmatted to it. What will end up in shm segment content? Anything.

You should check struct file refcount or something and disable attaching
while dumping or something.


Moral: Every time you do dump on something live you get complications.
Every single time.


There are sockets and live netns as the most complex example. I'm not
prepared to describe it exactly, but people wishing to do C/R with
"leaks" should be very careful with their wishes.

2009-04-15 21:40:35

by Oren Laadan

[permalink] [raw]
Subject: Re: C/R without "leaks"



Alexey Dobriyan wrote:
>> Again, so to checkpoint one task in the topmost pid-ns you need to
>> checkpoint (if at all possible) the entire system ?!
>
> One more argument to not allow "leaks" and checkpoint whole container,
> no ifs, buts and woulditbenices.
>
> Just to clarify, C/R with "leak" is for example when process has separate
> pidns, but shares, for example, netns with other process not involved in
> checkpoint.
>
> If you allow this, you lose one important property of checkpoint part,
> namely, almost everything is frozen. Losing this property means suddenly
> much more stuff is alive during dump and you has to account to more stuff
> when checkpointing. You effectively checkpointing on live data structures
> and there is no guarantee you'll get it right.

Alexey, we're entirely on par about this: everyone agrees that if you
want the maximal guarantee (if one exists) you must checkpoint entire
container and have no leaks.

The point I'm stressing is that there are other use cases, and other
users, that can do great things even without full container. And my
goal is to provide them this capability. Specially since the mechanism
is shared by both cases.

>
> Example 1: utsns is shared with the rest of the world.
>
> utsns content is modifiable only by tasks (current->nsproxy->uts_ns).
> Consequently, someone can modify utsns content while you're dumping it
> if you allow "leaks".
>
> Did you take precautions? Where?
>
> static int cr_write_utsns(struct cr_ctx *ctx, struct uts_namespace *uts_ns)
> {
> struct cr_hdr h;
> struct cr_hdr_utsns *hh;
> int domainname_len;
> int nodename_len;
> int ret;
>
> h.type = CR_HDR_UTSNS;
> h.len = sizeof(*hh);
>
> hh = cr_hbuf_get(ctx, sizeof(*hh));
> if (!hh)
> return -ENOMEM;
>
> nodename_len = strlen(uts_ns->name.nodename) + 1;
> domainname_len = strlen(uts_ns->name.domainname) + 1;
>
> hh->nodename_len = nodename_len;
> hh->domainname_len = domainname_len;
>
> ret = cr_write_obj(ctx, &h, hh);
> cr_hbuf_put(ctx, sizeof(*hh));
> if (ret < 0)
> return ret;
>
> ret = cr_write_string(ctx, uts_ns->name.nodename, nodename_len);
> if (ret < 0)
> return ret;
>
> ret = cr_write_string(ctx, uts_ns->name.domainname, domainname_len);
> return ret;
> }
>
> You should take uts_sem.

Fair enough. Will fix :)

However, even with leaks count you need the uts_sem, because it if
this is shared by another task when you start the checkpoint, but
not shared by the time you do the leak check - then you missed it.

And then, even the semaphore won't work unless you keep it for the
entire duration of the checkpoint: if task A and B inside the
container both know something about the UTS contents, and task C
outside modified it before the checkpoint was taken, then, at least
potentially, we have an inconsistency that neither you or I detect.

The best part of it, however, it is unlikely that either A or B
would ever *care* about that, especially in the case of UTS.

And that brings me to the moral: in so many cases the user will live
happily ever after even if the UTS is changes 50 times during the
checkpoint. Because her tasks don't care about it.

Remember that "flexibility" argument in my first post to this thread:
the next step is that the user can say "cradvise(UTS, I_DONT_CARE)":
during checkpoint the kernel won't save it, during restart the kernel
won't restore it. Voila, so little effort to make people happy :)

>
>
> Example 2: ipcns is shared with the rest of the world
>
> Consequently, shm segment is visible outside and live. Someone already
> shmatted to it. What will end up in shm segment content? Anything.

This is another excellent example. You are _so_ right that it doesn't
make much sense to try to restart a program that relies on something
that isn't part of the checkpoint.

And yet, there are a handful programs, applications, processes that
do not depend on the outside world in any important way, tasks that
frankly, my dear, don't give a ...

>
> You should check struct file refcount or something and disable attaching
> while dumping or something.

Yes, yes, yes !

But -- when you focus solely on the full-container-only case.

Deciding what's best for the users is a two-edged-sword. It works
well to achieve foolproof operation with the less knowledgeable,
but it's a bit of an arrogant approach for the more sophisticated
ones.

If you limit c/r to a full-container-only, you take away a freedom
from the users - you take away a huge opportunity to use the c/r
to its full potential. And you have this extra functionality for
nearly free ! It's like giving the user a full blown linux laptop
but disallowing use of the command line :p

>
> Moral: Every time you do dump on something live you get complications.
> Every single time.

"while(1);" will never have complications... :)

And seriously, yes, you can bring endless examples of when it won't
work. And others will bring their examples of when it will be ok
even with "complications", because if you don't care about certain
stuff, the "complication" becomes void.

We can always restrict c/r later, either by code, or privileges, or
system config, sysadmin policy, flag to checkpoint(2), you name it.
So those who seek general case guarantee are happy. Why do it a-priori
and block all other users ? is it of everyone's best interest to
decide now that no-one should ever do so ?

Oren.

>
>
> There are sockets and live netns as the most complex example. I'm not
> prepared to describe it exactly, but people wishing to do C/R with
> "leaks" should be very careful with their wishes.
>

2009-04-15 22:43:19

by Greg Kurz

[permalink] [raw]
Subject: Re: C/R without "leaks" (was: Re: Creating tasks on restart: userspace vs kernel)

On Wed, 2009-04-15 at 23:56 +0400, Alexey Dobriyan wrote:
> > Again, so to checkpoint one task in the topmost pid-ns you need to
> > checkpoint (if at all possible) the entire system ?!
>
> One more argument to not allow "leaks" and checkpoint whole container,
> no ifs, buts and woulditbenices.
>
> Just to clarify, C/R with "leak" is for example when process has separate
> pidns, but shares, for example, netns with other process not involved in
> checkpoint.
>
> If you allow this, you lose one important property of checkpoint part,
> namely, almost everything is frozen. Losing this property means suddenly
> much more stuff is alive during dump and you has to account to more stuff
> when checkpointing. You effectively checkpointing on live data structures
> and there is no guarantee you'll get it right.
>
> Example 1: utsns is shared with the rest of the world.
>
> utsns content is modifiable only by tasks (current->nsproxy->uts_ns).
> Consequently, someone can modify utsns content while you're dumping it
> if you allow "leaks".
>
> Did you take precautions? Where?
>
> static int cr_write_utsns(struct cr_ctx *ctx, struct uts_namespace *uts_ns)
> {
> struct cr_hdr h;
> struct cr_hdr_utsns *hh;
> int domainname_len;
> int nodename_len;
> int ret;
>
> h.type = CR_HDR_UTSNS;
> h.len = sizeof(*hh);
>
> hh = cr_hbuf_get(ctx, sizeof(*hh));
> if (!hh)
> return -ENOMEM;
>
> nodename_len = strlen(uts_ns->name.nodename) + 1;
> domainname_len = strlen(uts_ns->name.domainname) + 1;
>
> hh->nodename_len = nodename_len;
> hh->domainname_len = domainname_len;
>
> ret = cr_write_obj(ctx, &h, hh);
> cr_hbuf_put(ctx, sizeof(*hh));
> if (ret < 0)
> return ret;
>
> ret = cr_write_string(ctx, uts_ns->name.nodename, nodename_len);
> if (ret < 0)
> return ret;
>
> ret = cr_write_string(ctx, uts_ns->name.domainname, domainname_len);
> return ret;
> }
>
> You should take uts_sem.
>
>
> Example 2: ipcns is shared with the rest of the world
>
> Consequently, shm segment is visible outside and live. Someone already
> shmatted to it. What will end up in shm segment content? Anything.
>
> You should check struct file refcount or something and disable attaching
> while dumping or something.
>
>
> Moral: Every time you do dump on something live you get complications.
> Every single time.
>
>
> There are sockets and live netns as the most complex example. I'm not
> prepared to describe it exactly, but people wishing to do C/R with
> "leaks" should be very careful with their wishes.

They should close their sockets before checkpoint and find/have some way
to reconnect after. This implies some kind of C/R awareness in the code
to be checkpointed.

--
Gregory Kurz [email protected]
Software Engineer @ IBM/Meiosys http://www.ibm.com
Tel +33 (0)534 638 479 Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
Alan Moore.

2009-04-16 16:12:43

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: C/R without "leaks" (was: Re: Creating tasks on restart: userspace vs kernel)

On Thu, Apr 16, 2009 at 12:42:17AM +0200, Greg Kurz wrote:
> On Wed, 2009-04-15 at 23:56 +0400, Alexey Dobriyan wrote:

> > There are sockets and live netns as the most complex example. I'm not
> > prepared to describe it exactly, but people wishing to do C/R with
> > "leaks" should be very careful with their wishes.
>
> They should close their sockets before checkpoint and find/have some way
> to reconnect after. This implies some kind of C/R awareness in the code
> to be checkpointed.

How do you imagine sshd closing sockets and reconnecting?

2009-04-16 18:12:28

by Chris Friesen

[permalink] [raw]
Subject: Re: C/R without "leaks"

Alexey Dobriyan wrote:
> On Thu, Apr 16, 2009 at 12:42:17AM +0200, Greg Kurz wrote:
>> On Wed, 2009-04-15 at 23:56 +0400, Alexey Dobriyan wrote:
>
>>> There are sockets and live netns as the most complex example. I'm not
>>> prepared to describe it exactly, but people wishing to do C/R with
>>> "leaks" should be very careful with their wishes.
>> They should close their sockets before checkpoint and find/have some way
>> to reconnect after. This implies some kind of C/R awareness in the code
>> to be checkpointed.
>
> How do you imagine sshd closing sockets and reconnecting?

Don't you already have to handle the case where an sshd connection is
checkpointed, then the system is shutdown and the restore doesn't happen
until after the TCP timeout?

Chris

2009-04-16 18:41:28

by Oren Laadan

[permalink] [raw]
Subject: Re: C/R without "leaks"



Chris Friesen wrote:
> Alexey Dobriyan wrote:
>> On Thu, Apr 16, 2009 at 12:42:17AM +0200, Greg Kurz wrote:
>>> On Wed, 2009-04-15 at 23:56 +0400, Alexey Dobriyan wrote:
>>
>>>> There are sockets and live netns as the most complex example. I'm not
>>>> prepared to describe it exactly, but people wishing to do C/R with
>>>> "leaks" should be very careful with their wishes.
>>> They should close their sockets before checkpoint and find/have some way
>>> to reconnect after. This implies some kind of C/R awareness in the code
>>> to be checkpointed.
>>
>> How do you imagine sshd closing sockets and reconnecting?
>
> Don't you already have to handle the case where an sshd connection is
> checkpointed, then the system is shutdown and the restore doesn't happen
> until after the TCP timeout?

Any connection in that case is, of course, lost, and it's up to the
application to do something about it. If the application relies on
the state of the connection, it will have to give up (e.g. sshd, and
ssh, die).

However, there are many application that can withstand connection
lost without crashing. They simply retry (web browser, irc client,
db clients). With time, there may be more applications that are
'c/r-aware'.

Moreover, in some cases you could, on restart, use a wrapper to
create a new connection to somewhere (*), then ask restart(2) to
use that socket instead of the original, such that from the user
point of view things continue to work well, transparently.

(*) that somewhere, could be the original peer, or another server,
if it has a way to somehow continue a cut connection, or a special
wrapper server that you right for that purpose.

Oren.

2009-04-17 08:46:35

by Greg Kurz

[permalink] [raw]
Subject: Re: C/R without "leaks" (was: Re: Creating tasks on restart: userspace vs kernel)

On Thu, 2009-04-16 at 20:12 +0400, Alexey Dobriyan wrote:
> On Thu, Apr 16, 2009 at 12:42:17AM +0200, Greg Kurz wrote:
> > On Wed, 2009-04-15 at 23:56 +0400, Alexey Dobriyan wrote:
>
> > > There are sockets and live netns as the most complex example. I'm not
> > > prepared to describe it exactly, but people wishing to do C/R with
> > > "leaks" should be very careful with their wishes.
> >
> > They should close their sockets before checkpoint and find/have some way
> > to reconnect after. This implies some kind of C/R awareness in the code
> > to be checkpointed.
>
> How do you imagine sshd closing sockets and reconnecting?

Dunno and it isn't really my concern... I'm interested in HPC jobs that
can collaborate with the C/R feature. For examples, those jobs that use
interconnect hardware that will never be *checkpointable*... Usually,
the batch manager tells the jobs it's going to be checkpointed, so that
it can disconnect/shrink memory/reach quiescent point, and reconnect
after resuming execution.

I understand you aim at supporting transparent C/R of connected TCP
sockets. Nice feature. Could you give use cases where it's *really*
helpful/needed/mandatory ?

--
Gregory Kurz [email protected]
Software Engineer @ IBM/Meiosys http://www.ibm.com
Tel +33 (0)534 638 479 Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
Alan Moore.

2009-04-17 09:16:15

by Greg Kurz

[permalink] [raw]
Subject: Re: C/R without "leaks"

On Thu, 2009-04-16 at 14:39 -0400, Oren Laadan wrote:
> Any connection in that case is, of course, lost, and it's up to the
> application to do something about it. If the application relies on
> the state of the connection, it will have to give up (e.g. sshd, and
> ssh, die).
>

And that's a good thing since that's exactly what users expect from
sshd : to give up the connection when something goes wrong. I wouldn't
trust a sshd with the ability to initiate connections on its own...

And anyway, I still don't see the scenario where C/R a sshd is useful...
Please someone (Alexey ?), provide a detailed use case where people
would want to checkpoint or migrate live TCP connections... Discussion
on containers@ is very interesting but really lacks of
what-is-the-bigger-picture arguments... These huge patchsets are very
tricky and intrusive... who wants them mainline ? what's the use of
C/R ?

> However, there are many application that can withstand connection
> lost without crashing. They simply retry (web browser, irc client,
> db clients). With time, there may be more applications that are
> 'c/r-aware'.
>

HPC jobs are definitely good candidates.

> Moreover, in some cases you could, on restart, use a wrapper to
> create a new connection to somewhere (*), then ask restart(2) to
> use that socket instead of the original, such that from the user
> point of view things continue to work well, transparently.
>

Yes.

> (*) that somewhere, could be the original peer, or another server,
> if it has a way to somehow continue a cut connection, or a special
> wrapper server that you right for that purpose.
>
> Oren.
>
--
Gregory Kurz [email protected]
Software Engineer @ IBM/Meiosys http://www.ibm.com
Tel +33 (0)534 638 479 Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
Alan Moore.

2009-04-17 09:50:44

by Oren Laadan

[permalink] [raw]
Subject: Re: C/R without "leaks"



Greg Kurz wrote:
> On Thu, 2009-04-16 at 14:39 -0400, Oren Laadan wrote:
>> Any connection in that case is, of course, lost, and it's up to the
>> application to do something about it. If the application relies on
>> the state of the connection, it will have to give up (e.g. sshd, and
>> ssh, die).
>>
>
> And that's a good thing since that's exactly what users expect from
> sshd : to give up the connection when something goes wrong. I wouldn't
> trust a sshd with the ability to initiate connections on its own...
>
> And anyway, I still don't see the scenario where C/R a sshd is useful...

You mean an sshd with an open connection probably; the server itself
is clearly useful to be able to c/r.

> Please someone (Alexey ?), provide a detailed use case where people
> would want to checkpoint or migrate live TCP connections... Discussion
> on containers@ is very interesting but really lacks of
> what-is-the-bigger-picture arguments... These huge patchsets are very
> tricky and intrusive... who wants them mainline ? what's the use of
> C/R ?
>

A canonical example would a virtual-private-server: instead of doing
server consolidation with a virtual machine, your do with containers.
In a sense, containers lets you chop the OS into independent isolated
pieces. You ca use a linux box to run multiple virtual execution
environments (containers), each running services of your choice. They
could range from a sshd for users, to apache servers, to database
servers to users' vnc sessions, etc.

Now comes the that you really need to take the machine down, for
whatever reason. With c/r of live connections you can live-migrate
these containers to another machine (on the same subnet) that will
"steal" the IP as well, and voila - no service disruption.

Such scenarios are the focus of Alexey.

I'm also very interested in these scenarios, and I'm _also_ thinking
of other scenarios, where either (a) an entire container is not
necessary (example: user running long computation on laptop and wants
to save it before a reboot), or (b) the program would like to make
adjustments to its state compared to the time it was saved (example:
change the location of an output log file depending on the machine
on which your are running).

Unfortunately, if we plan for and require, as per Alexey, that c/r
would only work for whole-containers, these two cases will not be
addressed.

Oren.

>> However, there are many application that can withstand connection
>> lost without crashing. They simply retry (web browser, irc client,
>> db clients). With time, there may be more applications that are
>> 'c/r-aware'.
>>
>
> HPC jobs are definitely good candidates.
>
>> Moreover, in some cases you could, on restart, use a wrapper to
>> create a new connection to somewhere (*), then ask restart(2) to
>> use that socket instead of the original, such that from the user
>> point of view things continue to work well, transparently.
>>
>
> Yes.
>
>> (*) that somewhere, could be the original peer, or another server,
>> if it has a way to somehow continue a cut connection, or a special
>> wrapper server that you right for that purpose.
>>
>> Oren.
>>

2009-04-17 12:35:37

by Greg Kurz

[permalink] [raw]
Subject: Re: C/R without "leaks"

On Fri, 2009-04-17 at 05:48 -0400, Oren Laadan wrote:
> You mean an sshd with an open connection probably; the server itself
> is clearly useful to be able to c/r.
>

Yes I mean C/R of sshd with active connections.

>
> A canonical example would a virtual-private-server: instead of doing
> server consolidation with a virtual machine, your do with containers.
> In a sense, containers lets you chop the OS into independent isolated
> pieces. You ca use a linux box to run multiple virtual execution
> environments (containers), each running services of your choice. They
> could range from a sshd for users, to apache servers, to database
> servers to users' vnc sessions, etc.
>

Indeed, containers allow to implement VPS just like virtual machines: we
call them system containers. Not much to say about that since they don't
introduce new concepts to users.

> Now comes the that you really need to take the machine down, for
> whatever reason. With c/r of live connections you can live-migrate
> these containers to another machine (on the same subnet) that will
> "steal" the IP as well, and voila - no service disruption.
>

Theorically, yes. Practicaly, you need a lot more than *simply* capturing
and restoring socket states for such a migration to be usable in the real
world.

>
> Such scenarios are the focus of Alexey.
>

So Alexey should provide some realistic examples, with several hosts,
routers, switches and overall network infrastructure.

> I'm also very interested in these scenarios, and I'm _also_ thinking
> of other scenarios, where either (a) an entire container is not
> necessary (example: user running long computation on laptop and wants
> to save it before a reboot), or (b) the program would like to make
> adjustments to its state compared to the time it was saved (example:
> change the location of an output log file depending on the machine
> on which your are running).
>

I'm _only_ interested in these other scenarios for the moment.

> Unfortunately, if we plan for and require, as per Alexey, that c/r
> would only work for whole-containers, these two cases will not be
> addressed.
>

Discussion must go on then. There's no hurry in getting C/R
mainlined. :)

--
Gregory Kurz [email protected]
Software Engineer @ IBM/Meiosys http://www.ibm.com
Tel +33 (0)534 638 479 Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
Alan Moore.

2009-04-22 00:16:29

by Nathan Lynch

[permalink] [raw]
Subject: Re: C/R without "leaks"

Oren Laadan <[email protected]> writes:
> Alexey Dobriyan wrote:
>>> Again, so to checkpoint one task in the topmost pid-ns you need to
>>> checkpoint (if at all possible) the entire system ?!
>>
>> One more argument to not allow "leaks" and checkpoint whole container,
>> no ifs, buts and woulditbenices.
>>
>> Just to clarify, C/R with "leak" is for example when process has separate
>> pidns, but shares, for example, netns with other process not involved in
>> checkpoint.
>>
>> If you allow this, you lose one important property of checkpoint part,
>> namely, almost everything is frozen. Losing this property means suddenly
>> much more stuff is alive during dump and you has to account to more stuff
>> when checkpointing. You effectively checkpointing on live data structures
>> and there is no guarantee you'll get it right.
>
> Alexey, we're entirely on par about this: everyone agrees that if you
> want the maximal guarantee (if one exists) you must checkpoint entire
> container and have no leaks.
>
> The point I'm stressing is that there are other use cases, and other
> users, that can do great things even without full container. And my
> goal is to provide them this capability.

As it seems that Alexey's goal is more or less a subset of yours, would
it make sense in the near term to concentrate on getting an
implementation upstream that satisfies that subset (i.e. checkpoint on a
container basis only)? And then support for checkpointing arbitrary
processes could be added later, if it proves feasible?